Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - - PowerPoint PPT Presentation

▶

Jan 07, 2023 517 likes •763 views

Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services

SLIDE 1

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Text Mining on Mailing Lists: Sentiment Analysis

Gordon Heiczman, B. Sc.

October 13, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich

SLIDE 2

Introduction

What is Sentiment Analysis?

G. Heiczman — Sentiment Analysis

SLIDE 3

Introduction

What is Sentiment Analysis?

G. Heiczman — Sentiment Analysis

SLIDE 4

Introduction

What is Sentiment Analysis?

G. Heiczman — Sentiment Analysis

SLIDE 5

Introduction

Problems of today:

Too much information
Too little time
G. Heiczman — Sentiment Analysis

SLIDE 6

Introduction

Agenda

Text Mining summary
Example of practical application
Presentation of results
Conclusion and Lessons Learned
G. Heiczman — Sentiment Analysis

SLIDE 7

Text Mining

Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps:

Removal of stop words (the, is, at ...)
Removal of plurals (dogs -> dog)
Word / n-gram frequency
Part of Speech (POS) tagging (adjectives)
Opinion words (like, hate, love ...)
Detection of negation (not good -> bad)
G. Heiczman — Sentiment Analysis

SLIDE 8

Text Mining

Sentiment Classification Three main categories:

Machine Learning
Lexicon-based
Hybrid
G. Heiczman — Sentiment Analysis

SLIDE 9

Text Mining

Pitfalls

Named Entity Recognition i.e. "What is the topic"
Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?"
Sarcasm
Abbreviations, poor grammar / punctuation / spelling
G. Heiczman — Sentiment Analysis

SLIDE 10

Practical Application

Dataset
Language
Email retrieval
Content retrieval
Sentiment value retrieval
G. Heiczman — Sentiment Analysis

SLIDE 11

Practical Application

Dataset Collection of emails from the IETF. Task of IETF is to set standards.

G. Heiczman — Sentiment Analysis

SLIDE 12

Practical Application

Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools:

VaderSharp (free but primitive)
Aylien (paid)
Watson D.C. (paid)
Vivekn (free but no documentation)

Python tool: TextBlob

G. Heiczman — Sentiment Analysis

SLIDE 13

Practical Application

Multiple values obtained through SA:

Polarity ( -1.0 <-> 1.0)
Subjectivity (0.0 <-> 1.0)
Most used word
Sentence Count
G. Heiczman — Sentiment Analysis

SLIDE 14

Practical Application

Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation

G. Heiczman — Sentiment Analysis

SLIDE 15

Practical Application

Figure 1: Example of email with polarity 1.0

Filename: /home/.../geopriv/2007-12.mail
Key: 251
G. Heiczman — Sentiment Analysis

SLIDE 16

Practical Application

Programflow

G. Heiczman — Sentiment Analysis

SLIDE 17

Practical Application

Programflow

G. Heiczman — Sentiment Analysis

SLIDE 18

Statistics

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 i p

b 8 4 a l l i b n e m

d e a s d a t a t r a c k e r

q m t s i m a p e x t d n s

r t y p e

p p l i c a t i

s i e t f m i b s I 3 v p n h

e y

Figure 2: Top 10 groups who use the most sentences

Even distribution Indication of in-depth discussion or off-topic rambling?

G. Heiczman — Sentiment Analysis

SLIDE 19

Statistics

0.5 0.375854 0.3078360.303693 0.29163 0.276323 0.2532740.251799 0.25 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 i a

c r i b e s i s s l l 8 4 a l l d n s

e r h t t p i

v e r s i

a d d r

e l e c t

t 6 7 a t t e n d e e s 9 6

e n t

s m a l l

b g m p

Figure 3: Top 10 most positive groups

Logarithmic distribution Notable group: "iaoc-scribes"

G. Heiczman — Sentiment Analysis

SLIDE 20

Statistics

0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05

0.00 i e t f

a i l

s i r t f

i l i t y

h a r t e r 7 a t t e n d e e s s c h e m a w r e c i p s r a s a s l i u c g w e b p r i n t m i b

Figure 4: Top 10 most negative groups

Stronger logarithmic distribution Notable group: "ietf-sailors"

G. Heiczman — Sentiment Analysis

SLIDE 21

Statistics

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 i a

c r i b e s c

7 3 a t t e n d e e s i e t f

t c

e s 7 a t t e n d e e s i r t f

i l i t y

h a r t e r i p s r a i m a p k m a r t d c l c

Figure 5: Top 10 most subjective groups

Surprising top scores Discussion groups

G. Heiczman — Sentiment Analysis

SLIDE 22

Statistics

From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?)

G. Heiczman — Sentiment Analysis

SLIDE 23

Conclusion

Useful but not universally Lessons learned:

Filter the data-set intelligently
Don’t try to solve everything with one library
G. Heiczman — Sentiment Analysis