Text and Image Synergy with Feature Cross Technique for Gender - - PowerPoint PPT Presentation

text and image synergy with feature cross technique for
SMART_READER_LITE
LIVE PREVIEW

Text and Image Synergy with Feature Cross Technique for Gender - - PowerPoint PPT Presentation

Text and Image Synergy with Feature Cross Technique for Gender Identification CLEF/PAN 2018 Author Profiling Task September 10, 2018 Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki Taniguchi, and Tomoko Ohkuma Fuji


slide-1
SLIDE 1

Text and Image Synergy with Feature Cross Technique for Gender Identification

CLEF/PAN 2018 Author Profiling Task

September 10, 2018 Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki Taniguchi, and Tomoko Ohkuma

Fuji Xerox Co., Ltd.

slide-2
SLIDE 2

・Introduction ・PAN 2018 Author Profiling Task ・Related Work ・Our Motivation ・Proposed Model ・Experiment ・Result ・Discussion ・Conclusion & Future Works

Outlines

slide-3
SLIDE 3

■ Author profile traits on social media:

  • 1. Introduction

・Author profile traits can be applied to some app

  • traits: age, gender, location, …
  • App: advertisement, recommendation, marketing, …etc

traits

gender

App

advertisement marketing recommendation

■ Issues:

・Author profile traits are not explicitly described on social media.

  • This causes difficulty to utilize author profile traits on app

location texts images

Data

age

slide-4
SLIDE 4

■ Gender identification from Tweets:

  • 2. PAN 2018 Author Profiling Task

・Gender identification:

  • Binary classification from Tweets (male/female)

・Target languages:

  • Arabic, English, Spanish

・Datasets:

  • Text data contains 100 Tweets for each user
  • Image data contains 10 images for each user

Users Tweets Images Arabic 1,500 150,000 15,000 English 3,000 300,000 30,000 Spanish 3,000 300,000 30,000

New dataset in PAN 2018

TWITTER, TWEET, RETWEET and the Twitter logo are trademarks of Twitter, Inc. or its affiliates.

slide-5
SLIDE 5

■ Strong models at PAN 2017:

  • 3. Related Work (1)

・Traditional machine learning approaches successfully performed

  • Linear SVM with character 3- to 5-grams and word 1- to 2-grams features

(Basile et al., 2017)

  • Exploring many approaches and employing logistic regression

(Martinc et al., 2017)

  • Micro TC: generic framework for text classification (Tellez et al., 2017)

■ Deep Neural Network approaches at PAN 2017:

・DNN approaches were also presented

  • Bi-directional GRU with attention for word + CNN for character

(Miura et al., 2017)

  • CNN with convolutional filters of different sizes (Sierra et al., 2017)

In PAN 2017, DNN could not outperform traditional ML models

slide-6
SLIDE 6

■ Author profiling tasks outside of PAN:

  • 3. Related Work (2)

・Combining both texts and images in neural network

  • Prediction user’s traits (gender, age, political orientation, and location)
  • The model that utilized both texts and images showed state-of-the-art

performances (Vijayaraghavan et al., 2017)

■ Expectation:

・Utilizing not only texts but images would be effective for author profiling

slide-7
SLIDE 7

■ Deep Neural Network (DNN):

  • 4. Our Motivation

・In PAN 2017: DNN approach showed 4th ranking (Miura et al., 2017)

■ Unveiling images:

・PAN 2018 unveiled images to identify user’s gender

  • 10 images are prepared for each user
  • Many successful models exist in CV tasks (AlexNet, VGG16, ResNet)

■ Main approaches at PAN 2017:

・Traditional machine learning approaches successfully performed

  • SVM, Random Forest, Logistic Regression, …
  • Uni-gram, Bi-gram features were often employed

Performances will be enhanced combining texts with images in DNN

slide-8
SLIDE 8

■ Core idea

  • 5. Proposed Model

・Leverage the synergy of both texts and images with feature cross technique in neural network ・Relationship between both features are computed by direct-product

■ Major components

The model is constructed of three components: 1. Text Component: 2. Image Component: 3. Fusion Component Text Image Fusion Neural Network (TIFNN)

FC1

words

Word Embedding

RNNW PoolingW PoolingT Column-wise Pooling Row-wise Pooling FC1UT

label

FC2

CNNI

images

FCUI PoolingI

Text Component Image Component

Fusion Component

→ Inspired by (Santos et al., 2016) for QA

slide-9
SLIDE 9

■ Purpose of the component:

5-1. Text Component

・Encoding text representation from user’s Tweets ・Integrating 100 Tweets for each user into a representation

■ Model composition:

・RNNW: The layer is constructed of bi-directional GRU ・PoolingW: Integrating words in a tweet (word-level pooling) ・PoolingT: Integrating tweets in a user (Tweet-level pooling) RNNW: handles sentence word by word (each time step 𝑢)

slide-10
SLIDE 10

PoolingI

Image representation

FCUI

  • Image1

FC6 FC7

  • Conv. Layers 1

Pool1

  • Conv. Layers 2

Pool2

  • Conv. Layers 3

Pool3

  • Conv. Layers 4

Pool4

  • Conv. Layers 5

Pool5

Image10

FC6 FC7

  • Conv. Layers 1

Pool1

  • Conv. Layers 2

Pool2

  • Conv. Layers 3

Pool3

  • Conv. Layers 4

Pool4

  • Conv. Layers 5

Pool5

Image2

FC6 FC7

  • Conv. Layers 1

Pool1

  • Conv. Layers 2

Pool2

  • Conv. Layers 3

Pool3

  • Conv. Layers 4

Pool4

  • Conv. Layers 5

Pool5

average over images

Conv1-1 Pool1 Conv1-2 Conv3-1 Pool3 Conv3-2 Conv3-3

CNNI

■ Purpose of the component:

5-2. Image Component

・Encoding image representation from each user ・Integrating 10 images for each user into a representation

■ Model composition:

・CNNI: 13 convolutional layers, 5 pooling layers, 2 fully connected layers (VGG16) ・PoolingI: integrates 10 images in a user VGG16

slide-11
SLIDE 11

FC1 Column-wise Pooling Row-wise Pooling FC1UT

label

FC2

FCUI

Text Component Image Component

Fusion Component

■ Purpose of the component:

5-3. Fusion Component

・Leveraging synergy of both texts and images by feature cross technique ・Finally, the model classifies user’s gender using combined feature

■ Model composition:

・direct-product: captures the relationship between texts and images 𝑯 = 𝒔%&% ⊗ 𝒔()* ・Column-wise pooling: finds out the most relevant image element with respect to text representation 𝑕%&% , = max

01213[𝐻 ,,2]

𝑕()* , = max

01)18[𝐻),,]

text image

Column-wise pooling Row-wise pooling

slide-12
SLIDE 12

■ Dataset:

  • 6. Experiment

・PAN 2018 Author Profiling Task Corpus:

  • divided this corpus into train8, dev1, and test1 with a gender ratio 1:1

train8 dev1 test1 Full size Arabic 1,200 150 150 1,500 English 2,400 300 300 3,000 Spanish 2,400 300 300 3,000

■ Streaming Tweets:

・Collected Tweets to pre-train the word embedding matrix 𝑭: from Twitter by Twitter Streaming APIs

  • During the period of March-May 2017
  • Remove Retweets
  • Delete Tweets posted by bots

# of Tweets Arabic 2.46M English 10.72M Spanish 3.17M

TWITTER, TWEET, RETWEET and the Twitter logo are trademarks of Twitter, Inc. or its affiliates.

slide-13
SLIDE 13

■ Pre-train word embedding & VGG16

6-1. Training Procedures (1)

・Initialization of word embeddings:

  • Utilized fastText with the skip-gram algorithm to pre-train word

embedding (Bojanowski et al., 2016) ・Initialization of CNNI

  • CNNI is initialized with parameters of pre-trained VGG16 on ImageNet

words

Word Embedding

RNNW PoolingW PoolingT FC1UT

CNNI

images

FCUI PoolingI

Text Component Image Component

slide-14
SLIDE 14

■ Component-wise training:

6-1. Training Procedures (2)

・Text component:

  • Text component is trained using train8 and dev1

・Image component:

  • Image component is trained using train8 and dev1

NOTE: Each component is trained without fusion component!!

words

Word Embedding

RNNW PoolingW PoolingT FC1UT

CNNI

images

FCUI PoolingI

Text Component Image Component

slide-15
SLIDE 15

■ TIFNN training:

6-1. Training Procedures (3)

・All of TIFNN parameters except final FC layers are initialized with parameters of the pre-trained components à The entire model is trained by fine-tuning using train8 and dev1

FC1

words

Word Embedding

RNNW PoolingW PoolingT Column-wise Pooling Row-wise Pooling FC1UT

label

FC2

CNNI

images

FCUI PoolingI

Text Component Image Component

Fusion Component

Fine-tuning

slide-16
SLIDE 16

■ Comparison Models:

6-2. Comparison Models

・SVM: SVM using TF-IDF uni-gram features; strong baseline ・Text NN: Text component and a fully connected layer ・Image NN: Image component ・Text NN + Image NN: Combines both NNs without fusion component

words

Text Component

FC2UT

label

Image Component

images

label

FC1

label

FC2

words

Text Component

FC2UT

Image Component

images

Text NN Image NN Text NN + Image NN

words

Text Component Image Component

images

FC1 Column-wise Pooling Row-wise Pooling

label

FC2

TIFNN

slide-17
SLIDE 17

■ In-house experiment:

  • 7. Result (In-house Experiment)

・Text NN and Image NN achieved accuracies of 80.0-82.3% ・TIFNN drastically improved the accuracies: + 2.7-8.6pt !

  • Significantly improved for English

・TIFNN also outperformed Text NN + Image NN

0.65 0.7 0.75 0.8 0.85 0.9 0.95 Arabic English Spanish Accuracy

SVM Text NN Image NN Text NN + Image NN TIFNN

+ 8.6pt

slide-18
SLIDE 18

0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 Arabic English Spanish Accuracy Text NN Image NN TIFNN Participant (Best)

■ Submission run:

  • 7. Result (Submission Run)

・TIFNN had better accuracies compared with individual models (1.3-6.1pt)

  • The model had lower accuracies compared with In-house experiment

à Perhaps overfitting ・Image NN significantly outperformed other systems ・Ranked 1st in entire participants

+ 6.1pt

0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 Arabic English Spanish Accuracy

participant (Best) Image NN

slide-19
SLIDE 19
  • 8. Gender Identification by Human (1)

■ Correlation between Human and Image NN:

・Image NN showed superior performances in this task

  • How much accuracies can humans identify user’s gender from

images? → Investigating the correlation between human and Image NN

■ Categorizing target users:

・Target users were divided into 3 types of category: group 1 group 2 group 3

Image NN incorrect (Acc = 0.0) Image NN correct (Acc=1.0) Softmax are between 0.33-0.66 (Acc=0.5)

slide-20
SLIDE 20

■ Experimental result:

  • 8. Gender Identification by Human (2)

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 group1 group2 group3 Accuracy

・The trend is the same between human and Image NN

  • group1: Human can identify user’s gender with 45% accuracy
  • group2: The accuracy is better 10% than Image NN
  • group3: The accuracy drops 25% compared with Image NN

Average accuracy: 60%

Image NN=0% Image NN=50% Image NN=100%

slide-21
SLIDE 21

■ Conclusion:

  • 9. Conclusion & Future Works

・Proposed Text Image Fusion Neural Network (TIFNN) for gender identification

  • Components:
  • Text component
  • Image component
  • Fusion component

・Improvement compared with individual models

  • In-house experiment: + 2.7-8.6pt for each language
  • Submission run: + 1.3-6.1pt à Ranked 1st in entire participants

■ Future Works:

・Analyzing how the proposed model interacts with texts and images

  • Understanding this interaction makes it possible to improve TIFNN
slide-22
SLIDE 22

Thank you !!