Understanding Hidden Memories of Recurrent Neural Networks Yao Ming - - PowerPoint PPT Presentation

β–Ά
understanding hidden memories of recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming - - PowerPoint PPT Presentation

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu. THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY What is a Recurrent Neural Network? H K U S T


slide-1
SLIDE 1

Understanding Hidden Memories of Recurrent Neural Networks

Yao Ming, Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu.

THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY

slide-2
SLIDE 2

2

H K U S T

What is a Recurrent Neural Network?

slide-3
SLIDE 3

Introduction

3

H K U S T

x(t) h(t) tanh y(t)

A vanilla RNN

What is Recurrent Neural Networks (RNN)?

Machine Translation, Speech Recognition, Language Modeling, … A deep learning model used for:

slide-4
SLIDE 4

RNN

input

  • utput

hidden state Visual Analytics Science & Technology 4

H K U S T A 2-layer RNN

π’Š(#) = tanh (π‘Ώπ’Š #,- + π‘Ύπ’š(#))

A vanilla RNN takes an input π’š(#), and update its hidden state π’Š(#,-) using:

𝑦(-) 𝑧(-) 𝑦(3) 𝑧(3) 𝑦(4) 𝑧(4) 𝑦(5) 𝑧(5)

Introduction

What is Recurrent Neural Networks (RNN)?

x(t) h(t) tanh h(t) tanh y(t)

π’Š-

#

π’Š3

#

x(t) h(t) tanh y(t)

A vanilla RNN

slide-5
SLIDE 5

5

H K U S T

What has the RNN learned from data?

input

  • utput

?

slide-6
SLIDE 6

6

H K U S T

  • A. map the value of a single hidden unit on data (Karpathy A. et al., 2015)

Motivation

What has the RNN learned from data?

A unit sensitive to position in a line. A lot more units have no clear meanings.

slide-7
SLIDE 7

7

H K U S T

  • B. matrix plots (Li J. et. al., 2016)

Each column represents the value of the hidden state vector when reads a input word

Motivation

What has the RNN learned from data?

Scalability!

Machine Translation: 4-layer, 1000 units/layer (Sutskever I. et al., 2014) Language Modeling: 2-layer, 1500 units/layer (Zaremba et al., 2015)

slide-8
SLIDE 8

8

H K U S T

Our Solution - RNNVis

slide-9
SLIDE 9

9

H K U S T

Our Solution

Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation

slide-10
SLIDE 10

10

H K U S T

Model’s response to a word π‘₯ at step 𝑒: the update of hidden state Ξ”π’Š # Ξ”π’Š # = Ξ”β„Ž:

(#) , 𝑗 = 1, … , π‘œ.

Solution

Explaining an individual hidden unit using its most salient words Larger abs(Ξ”β„Ž:

# ) implies that the word π‘₯ is more salient to unit 𝑗.

Since Ξ”β„Ž:

(#) can vary given the same word π‘₯, we use the expectation:

E Ξ”π’Š # |π‘₯# = π‘₯ How to define salient? Can be estimated by running the model on dataset and take the mean.

slide-11
SLIDE 11

11

H K U S T

Solution

Explaining an individual hidden unit using its most salient words

response Unit: #36

Top 4 positive/negative salient words of unit 36 in an RNN (GRU) trained on Yelp review data.

25% - 75% 9% - 91%

slide-12
SLIDE 12

12

H K U S T

Solution

Explaining an individual hidden unit using its most salient words

Distribution of model’s response given the word β€œhe”. Units reordered according to the mean. (an LSTM with 600 units)

Highly responsive hidden units

Unit #

mean 25% - 75% 9% - 91%

slide-13
SLIDE 13

13

H K U S T

Solution

Explaining an individual hidden unit using its most salient words

Investigating one unit/word at a time…

P: Too much user burden! S: An overview for easier exploration

slide-14
SLIDE 14

14

H K U S T

Solution

Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation

slide-15
SLIDE 15

15

H K U S T

Solution

Bi-graph Formulation

he she by can may

Hidden Units Words

slide-16
SLIDE 16

16

H K U S T

Solution

Bi-graph Formulation

he she by can may

Hidden Units Words

slide-17
SLIDE 17

17

H K U S T

Solution

Co-clustering

he she by can may

Hidden Units Words Algorithm* Spectral co-clustering (Dhillon I. S., 2001)

slide-18
SLIDE 18

18

H K U S T

Solution

Co-clustering – Edge Aggregation

he she by can may

Hidden Units Words

Color: sign of the average edge weight Width: scale of the average edge weight

slide-19
SLIDE 19

he she by can may

19

H K U S T

Solution

Hidden Units Words Co-clustering - Visualization

slide-20
SLIDE 20

he she by can may

20

H K U S T

Solution

Hidden Units Words Hidden Units Clusters (Memory Chips) Words Clusters (Word Clouds) Co-clustering - Visualization

Color: each unit’s salience to the selected word

slide-21
SLIDE 21

21

H K U S T

Solution

Explaining individual hidden units Bi-graph and co-clustering Sequence evaluation

slide-22
SLIDE 22

22

H K U S T

Solution

Glyph design for evaluating sentences Each glyph summarizes the dynamics of hidden unit clusters when reading a word

The ratio of preserved value Each bar represents the average scale of the value in a hidden units cluster Current value Increased value Decreased value More positive value are preserved More negative value are preserved Update towards positive Update towards negative

slide-23
SLIDE 23

23

H K U S T

Case Studies

How do RNNs handle sentiments? The language of Shakespeare

slide-24
SLIDE 24

24

H K U S T

Case Study – Sentiment Analysis

Each unit has two sides Single-layer GRU with 50 hidden units (cells), trained on Yelp review data

slide-25
SLIDE 25

25

H K U S T

Case Study – Sentiment Analysis

RNNs can learn to handle the context Single-layer GRU with 50 hidden units (cells), trained on Yelp review data

Sentence A: I love the food, though the staff is not helpful Sentence B: The staff is not helpful, though I love the food A B negative positive

Update towards positive Update towards negative

slide-26
SLIDE 26

26

H K U S T

Case Study – Sentiment Analysis

Clues for the problem Single-layer GRU with 50 hidden units (cells), trained on Yelp review data. Problem: the data is not evenly sampled.

slide-27
SLIDE 27

27

H K U S T

Case Study – Sentiment Analysis

Visual indicator of the performance Single-layer GRUs with 50 hidden units (cells), trained on Yelp review data. Balanced Dataset Unbalanced Dataset

Accuracy (test): 88.6% Accuracy (test): 91.9%

slide-28
SLIDE 28

28

H K U S T

Case Studies

How do RNNs handle the sentiments? The language of Shakespeare

slide-29
SLIDE 29

29

H K U S T

Case Study – Language Modeling

The language of Shakespeare – A mixture of the old and the new

slide-30
SLIDE 30

30

H K U S T

Case Study – Language Modeling

The language of Shakespeare – A mixture of the old and the new

slide-31
SLIDE 31

31

H K U S T

Discussion & Future Work

  • Clustering. The quality of co-clustering? Interactive clustering?
  • Glyph-based sentence visualization. Scalability?
  • Text data. How about speech data?
  • RNN models. More advanced RNN-based models like attention models?
slide-32
SLIDE 32

32

H K U S T

Thank you!

Contact: Yao Ming, ymingaa@connect.ust.hk Page: www.myaooo.com/rnnvis Code: www.github.com/myaooo/rnnvis

slide-33
SLIDE 33

33

H K U S T

Technical Details

Explaining individual hidden units - Decomposition

The output of an RNN at step 𝑒 is typically a probability distribution: π‘ž: = softmax π‘½π’Š(#) = exp 𝒗:

Lπ’Š #

βˆ‘ exp(𝒗N

Lπ’Š # )

  • N

where 𝑽 = 𝒗:

L , 𝑗 = 1,2, … , π‘œ, is the output projection matrix.

The numerator of π‘ž: can be decomposed to: exp 𝒗:

Lπ’Š #

= exp Q 𝒗:

L π’Š R βˆ’ π’Š R,- # RT-

= U exp(𝒗:

LΞ”π’Š # ) 𝒖 𝝊T𝟐

Here exp(𝒗:

LΞ”π’Š # ) is the multiplicative contribution of input word π‘₯#, the update of hidden state

Ξ”π’Š # can be regard as the model’s response to π‘₯#.

slide-34
SLIDE 34

34

H K U S T

Evaluation

Expert Interview

Show a tutorial video

1 4 2

Explore the tool

3

Compare two models Answer questions

5

Finish a survey

slide-35
SLIDE 35

35

H K U S T

  • 1. The complexity of the model
  • Semantic information are distributed in hidden states of an RNN.
  • Machine Translation: 4-layer LSTMs, 1000 units/layer (Sutskever I. et al., 2014)
  • Language Modeling: 2-layer LSTMs, 650 or 1500 units/layer (Zaremba et al., 2015)

Challenges

What are the challenges?

  • 3. The complexity of the data
  • Patterns in sequential data like texts are difficult to be analyzed and interpreted
  • 2. The complexity of the hidden memory
slide-36
SLIDE 36

36

H K U S T

Other Findings

Comparing LSTMs and vanilla RNNs

Left (A-C): co-cluster visualization of the last layer of an RNN. Right (D-F): visualization of the cell states of the last layer of an LSTM. Bottom (GH): two models’ responses to the same word β€œoffer”.

slide-37
SLIDE 37

37

H K U S T

Contribution

  • A visual technique for understanding what RNNs learned.
  • A VA tool that ablates the hidden dynamics of a trained RNN.
  • Interesting findings with RNN models.