Convolutional Networks for Text Pengfei Liu Site - - PowerPoint PPT Presentation

convolutional networks for text
SMART_READER_LITE
LIVE PREVIEW

Convolutional Networks for Text Pengfei Liu Site - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Convolutional Networks for Text Pengfei Liu Site https://phontron.com/class/nn4nlp2020/ With some slides by Graham Neubig Outline 1. Feature Combinations 2. CNNs and Key Concepts 3. Case Study on Sentiment


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Convolutional Networks for Text

Pengfei Liu

Site https://phontron.com/class/nn4nlp2020/

With some slides by Graham Neubig

slide-2
SLIDE 2

Outline

  • 1. Feature Combinations
  • 2. CNNs and Key Concepts
  • 3. Case Study on Sentiment Classification
  • 4. CNN Variants and Applications
  • 5. Structured CNNs
  • 6. Summary
slide-3
SLIDE 3

An Example Prediction Problem: Sentiment Classification

I hate this movie I love this movie

very good good neutral bad very bad very good good neutral bad very bad

? ?

slide-4
SLIDE 4

An Example Prediction Problem: Sentiment Classification

I hate this movie I love this movie

very good good neutral bad very bad very good good neutral bad very bad

slide-5
SLIDE 5

An Example Prediction Problem: Sentiment Classification

I hate this movie I love this movie

very good good neutral bad very bad very good good neutral bad very bad

how does our machine to do this task?

slide-6
SLIDE 6

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

  • One of the simplest

methods

  • Discrete symbols to

continuous vectors

slide-7
SLIDE 7

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

  • One of the simplest

methods

  • Discrete symbols to

continuous vectors

  • Average all vectors
slide-8
SLIDE 8

Deep CBOW

I hate this movie + bias = scores

W

+ + + =

tanh( W1*h + b1) tanh( W2*h + b2)

  • More linear

transformations followed by activation functions (Multilayer Perceptron, MLP)

slide-9
SLIDE 9

What’s the Use of the “Deep”

  • Multiple MLP layers allow us easily to learn feature

combinations (a node in the second layer might be “feature 1 AND feature 5 are active”)

  • e.g. capture things such as “not” AND “hate”
  • BUT! Cannot handle “not hate”
slide-10
SLIDE 10

Handling Combinations

slide-11
SLIDE 11

Bag of n-grams

I hate this movie bias sum( ) = scores

softmax

probs

  • A contiguous sequence of words
  • Concatenate word vectors
slide-12
SLIDE 12

Why Bag of n-grams?

  • Allow us to capture

combination features in a simple way “don’t love”, “not the best”

  • Decent baseline and

works pretty well

slide-13
SLIDE 13

What Problems w/ Bag of n-grams?

  • Same as before: parameter explosion
  • No sharing between similar words/n-grams
  • Lose the global sequence order
slide-14
SLIDE 14

What Problems w/ Bag of n-grams?

  • Same as before: parameter explosion
  • No sharing between similar words/n-grams
  • Lose the global sequence order

Other solutions?

slide-15
SLIDE 15

Neural Sequence Models

slide-16
SLIDE 16

Neural Sequence Models

Most of NLP tasks  Sequence representation learning problem

slide-17
SLIDE 17

Neural Sequence Models

char: i-m-p-o-s-s-i-b-l-e word: I-love-this-movie

slide-18
SLIDE 18

Neural Sequence Models

CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs

slide-19
SLIDE 19

Neural Sequence Models

CBOW Bag of n-grams

CNNs

RNNs Transformer GraphNNs

slide-20
SLIDE 20

Convolutional Neural Networks

slide-21
SLIDE 21

Convolution -- > mathematical operation

  • Continuous
  • Discrete

Definition of Convolution

slide-22
SLIDE 22

Convolution -- > mathematical operation

  • Continuous
  • Discrete

Definition of Convolution

slide-23
SLIDE 23

Intuitive Understanding

Input: feature vector Filter: learnable param. Output: hidden vector

slide-24
SLIDE 24

Priori Entailed by CNNs

slide-25
SLIDE 25

Priori Entailed by CNNs

Local bias:

Different words could interact with their neighbors

slide-26
SLIDE 26

Priori Entailed by CNNs

Different words could interact with their neighbors

Local bias:

slide-27
SLIDE 27

Priori Entailed by CNNs

Parameter sharing:

The parameters of composition function are the same.

slide-28
SLIDE 28

Basics of CNNs

slide-29
SLIDE 29

Concept: 2d Convolution

  • Deal with 2-dimension signal, i.e., image
slide-30
SLIDE 30

Concept: 2d Convolution

slide-31
SLIDE 31

Concept: 2d Convolution

slide-32
SLIDE 32

Concept: Stride

Stride: the number of units shifts over the input matrix.

slide-33
SLIDE 33

Concept: Stride

Stride: the number of units shifts over the input matrix.

slide-34
SLIDE 34

Concept: Stride

Stride: the number of units shifts over the input matrix.

slide-35
SLIDE 35

Concept: Padding

Padding: dealing with the units at the boundary

  • f input vector.
slide-36
SLIDE 36

Concept: Padding

Padding: dealing with the units at the boundary

  • f input vector.
slide-37
SLIDE 37

Three Types of Convolutions

Narrow

m=7 n=3 m-n+1=5

slide-38
SLIDE 38

Three Types of Convolutions

Narrow Equal

m=7 n=3 m-n+1=5 m=7 n=3 m-n+1=5

slide-39
SLIDE 39

Three Types of Convolutions

Narrow

m=7 n=3 m-n+1=5

slide-40
SLIDE 40

Three Types of Convolutions

Narrow Equal

m=7 n=3 m-n+1=5 m=7 n=3 m

slide-41
SLIDE 41

Three Types of Convolutions

Narrow Equal

m=7 n=3 m-n+1=5 m=7 n=3 m

Wide

m=7 n=3 m+n-1=9

slide-42
SLIDE 42

Concept: Multiple Filters

Motivation: each filter represents a unique feature of the convolution window.

slide-43
SLIDE 43

Concept: Pooling

  • Pooling is an aggregation operation, aiming to select

informative features

slide-44
SLIDE 44

Concept: Pooling

  • Pooling is an aggregation operation, aiming to select

informative features

  • Max pooling: “Did you see this feature anywhere in the

range?” (most common)

  • Average pooling: “How prevalent is this feature over the

entire range”

  • k-Max pooling: “Did you see this feature up to k times?”
  • Dynamic pooling: “Did you see this feature in the

beginning? In the middle? In the end?”

slide-45
SLIDE 45

Max pooling:

Concept: Pooling

slide-46
SLIDE 46

Max pooling: Mean pooling:

Concept: Pooling

slide-47
SLIDE 47

Max pooling: Mean pooling: K-max pooling

Concept: Pooling

slide-48
SLIDE 48

Max pooling: Mean pooling: K-max pooling

Concept: Pooling

Dynamic pooling:

slide-49
SLIDE 49

Case Study: Convolutional Networks for Text Classification (Kim 2015)

slide-50
SLIDE 50

CNNs for Text Classification (Kim 2015)

  • Task: sentiment classification
  • Input: a sentence
  • Output: a class label (positive/negative)
slide-51
SLIDE 51

CNNs for Text Classification (Kim 2015)

  • Task: sentiment classification
  • Input: a sentence
  • Output: a class label (positive/negative)
  • Model:
  • Embedding layer
  • Multi-Channel CNN layer
  • Pooling layer/Output layer
slide-52
SLIDE 52

Overview of the Architecture

Filter Input CNN Pooling Output Dict

slide-53
SLIDE 53

Embedding Layer

Input Look-up Table

  • Build a look-up table (pre-

trained? Fine-tuned?)

  • Discrete  distributed
slide-54
SLIDE 54
  • Conv. Layer
slide-55
SLIDE 55
  • Conv. Layer
  • Stride size?
slide-56
SLIDE 56
  • Conv. Layer
  • Stride size?
  • 1
slide-57
SLIDE 57
  • Conv. Layer
  • Wide, equal, narrow?
slide-58
SLIDE 58
  • Conv. Layer
  • Wide, equal, narrow?
  • narrow
slide-59
SLIDE 59
  • Conv. Layer
  • How many filters?
slide-60
SLIDE 60
  • Conv. Layer
  • How many filters?
  • 4
slide-61
SLIDE 61

Pooling Layer

  • Max-pooling
  • Concatenate
slide-62
SLIDE 62

Output Layer

  • MLP layer
  • Dropout
  • Softmax
slide-63
SLIDE 63

CNN Variants

slide-64
SLIDE 64

Priori Entailed by CNNs

  • Local bias
  • Parameter sharing
slide-65
SLIDE 65

Priori Entailed by CNNs

  • Local bias
  • Parameter sharing

How to handle long-term dependencies? How to handle different types

  • f compositionality?
slide-66
SLIDE 66

Priori Entailed by CNNs

Priori Advantage Limitation

slide-67
SLIDE 67

CNN Variants

  • Long-term dependency
  • increase receptive fields (dilated)
  • Complicated Interaction
  • dynamic filters

Locality Bias Sharing Parameters

slide-68
SLIDE 68

Dilated Convolution

(e.g. Kalchbrenner et al. 2016)

i _ h a t e _ t h i s _ f i l m sentence class (classification) next char (language modeling) word class (tagging)

  • Long-term dependency with less layers
slide-69
SLIDE 69

Dynamic Filter CNN (e.g. Brabandere et al. 2016)

  • Parameters of filters are static, failing to capture rich

interaction patterns.

  • Filters are generated dynamically conditioned on an

input.

slide-70
SLIDE 70

Common Applications

slide-71
SLIDE 71

CNN Applications

  • Word-level CNNs
  • Basic unit: word
  • Learn the representation of a sentence
  • Phrasal patterns
  • Char-level CNNs
  • Basic unit: character
  • Learn the representation of a word
  • Extract morphological patters
slide-72
SLIDE 72

CNN Applications

  • Word-level CNN
  • Sentence representation
slide-73
SLIDE 73

NLP (Almost) from Scratch

(Collobert et al.2011)

  • One of the most important

papers in NLP

  • Proposed as early as 2008
slide-74
SLIDE 74

CNN Applications

  • Word-level CNN
  • Sentence representation
  • Char-level CNN
  • Text Classification
slide-75
SLIDE 75

CNN-RNN-CRF for Tagging

(Ma et al. 2016)

  • A classic framework and de-facto standard for

tagging

  • Char-CNN is used to learn word representations

(extract morphological information).

  • Complementarity
slide-76
SLIDE 76

Structured Convolution

slide-77
SLIDE 77

Why Structured Convolution?

The man ate the egg.

slide-78
SLIDE 78

Why Structured Convolution?

The man ate the egg. vanilla CNNs

slide-79
SLIDE 79

Why Structured Convolution?

The man ate the egg.

  • Some convolutional
  • perations are not

necessary

  • e.g. noun-verb pairs very

informative, but not captured by normal CNNs

vanilla CNNs

slide-80
SLIDE 80

Why Structured Convolution?

The man ate the egg.

  • Some convolutional
  • perations are not

necessary

  • e.g. noun-verb pairs very

informative, but not captured by normal CNNs

  • Language has structure,

would like it to localize features

slide-81
SLIDE 81

Why Structured Convolution?

The man ate the egg.

  • Some convolutional
  • perations are not

necessary

  • e.g. noun-verb pairs very

informative, but not captured by normal CNNs

  • Language has structure,

would like it to localize features

The “Structure” provides stronger prior!

slide-82
SLIDE 82

Tree-structured Convolution

(Mou et al. 2014, Ma et al. 2015)

  • Convolve over parents, grandparents, siblings
slide-83
SLIDE 83

Graph Convolution

(e.g. Marcheggiani et al. 2017)

  • Convolution is shaped by graph

structure

  • For example, dependency

tree is a graph with 1) Self-loop connection 2) Dependency connections 3) Reverse connections

slide-84
SLIDE 84

Summary

slide-85
SLIDE 85

Neural Sequence Models

CBOW Bag of n-grams CNNs RNNs Transformer GraphNNs

slide-86
SLIDE 86

Neural Sequence Models

How do we make the choices of different neural sequence models?

slide-87
SLIDE 87

Understand the design philosophy

  • f a model
  • Inductive bias: the set of assumptions that the

learner uses to predict outputs given inputs that it has not encountered (from wikipedia)

  • Structural bias: a set of prior knowledge

incorporated into your model design

slide-88
SLIDE 88

Structural Bias

  • Structural bias: a set of prior knowledge

incorporated into your model design Local Non-local

  • Locality
slide-89
SLIDE 89

Structural Bias

  • Structural bias: a set of prior knowledge

incorporated into your model design

  • Topological structure

Sequential Tree Graph

slide-90
SLIDE 90

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph

slide-91
SLIDE 91

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph CNN RNN

slide-92
SLIDE 92

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph Structured CNN

slide-93
SLIDE 93

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph

?

slide-94
SLIDE 94

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph

?

slide-95
SLIDE 95

What inductive bias does a neural component entail?

Locality Bias Topological Structure Local Non-local Seq. Tree Graph

?

slide-96
SLIDE 96

Questions?