Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. - PowerPoint PPT Presentation
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Text Mining on Mailing Lists: Sentiment Analysis Gordon Heiczman, B. Sc. October 13, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 2
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 3
Introduction What is Sentiment Analysis? G. Heiczman — Sentiment Analysis 4
Introduction Problems of today: • Too much information • Too little time G. Heiczman — Sentiment Analysis 5
Introduction Agenda • Text Mining summary • Example of practical application • Presentation of results • Conclusion and Lessons Learned G. Heiczman — Sentiment Analysis 6
Text Mining Feature Selection Main purpose: Extract valuable information, get rid of redundant features ’Bag of Words’ approach Most common selection steps: • Removal of stop words (the, is, at ...) • Removal of plurals (dogs -> dog) • Word / n-gram frequency • Part of Speech (POS) tagging (adjectives) • Opinion words (like, hate, love ...) • Detection of negation (not good -> bad) G. Heiczman — Sentiment Analysis 7
Text Mining Sentiment Classification Three main categories: • Machine Learning • Lexicon-based • Hybrid G. Heiczman — Sentiment Analysis 8
Text Mining Pitfalls • Named Entity Recognition i.e. "What is the topic" • Anaphora Resolution - Reference word resolution. "What is ’it’ refering to?" • Sarcasm • Abbreviations, poor grammar / punctuation / spelling G. Heiczman — Sentiment Analysis 9
Practical Application • Dataset • Language • Email retrieval • Content retrieval • Sentiment value retrieval G. Heiczman — Sentiment Analysis 10
Practical Application Dataset Collection of emails from the IETF. Task of IETF is to set standards. G. Heiczman — Sentiment Analysis 11
Practical Application Language C# or Python? Not enough comprehensive, completely free tools Notable C# tools: • VaderSharp (free but primitive) • Aylien (paid) • Watson D.C. (paid) • Vivekn (free but no documentation) Python tool: TextBlob G. Heiczman — Sentiment Analysis 12
Practical Application Multiple values obtained through SA: • Polarity ( -1.0 <-> 1.0) • Subjectivity (0.0 <-> 1.0) • Most used word • Sentence Count G. Heiczman — Sentiment Analysis 13
Practical Application Textblob example blob = TextBlob("I think this presentation is really, really good!") print(blob.sentiment) # Gives both polarity and subjectivity around 1.0 print(blob.words.count(’really’)) # Gives 2 print(blob.noun_phrases) # Gives nouns, in this case presentation G. Heiczman — Sentiment Analysis 14
Practical Application Figure 1: Example of email with polarity 1.0 • Filename: /home/.../geopriv/2007-12.mail • Key: 251 G. Heiczman — Sentiment Analysis 15
Practical Application Programflow G. Heiczman — Sentiment Analysis 16
Practical Application Programflow G. Heiczman — Sentiment Analysis 17
Statistics 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 b l l s t n a o a x s y i m s s b p e o 4 e t e n v m i k p 8 e d p o m 3 o i a n i q t i f I h b r m a t - e i r c e i i i l k p c p a a r - t e a p a t y d t r r - s n d Figure 2: Top 10 groups who use the most sentences Even distribution Indication of in-depth discussion or off-topic rambling? G. Heiczman — Sentiment Analysis 18
Statistics 0.60 0.55 0.5 0.50 0.45 0.40 0.375854 0.35 0.3078360.303693 0.29163 0.276323 0.30 0.2532740.251799 0.25 0.25 0.25 0.20 0.15 0.10 0.05 0.00 s l l l s c e s a l l t s p p o d e r o m b s 4 t - o i t o t e l l i 8 h c t a g r - t d n c r n e n e m b s e o e l e m - v c o i s t o s a t - s r - 6 a n e r 7 9 d i d v d 6 n a o c - a l o i Figure 3: Top 10 most positive groups Logarithmic distribution Notable group: "iaoc-scribes" G. Heiczman — Sentiment Analysis 19
Statistics 0.00 -0.05 -0.10 -0.15 -0.20 -0.25 -0.30 -0.35 -0.40 -0.45 -0.50 c l g s s a a s b m e r c b r r e r s a e o e e w s u w m i l e p i i t d a r h i t s a n c n h - e s r i f c t p t t e - a i y t 0 i l 7 b i o m - f t r i Figure 4: Top 10 most negative groups Stronger logarithmic distribution Notable group: "ietf-sailors" G. Heiczman — Sentiment Analysis 20
Statistics 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 s l s s a p t c e o s r r l e e e r s a a c b g e e e m d o m p m r i d d t c m r i i n o n a k s c h - s e e c o t t t c t u t o c a a - o y a 3 - 0 t i f i 7 t 7 l e b i i o m - f t r i Figure 5: Top 10 most subjective groups Surprising top scores Discussion groups G. Heiczman — Sentiment Analysis 21
Statistics From the 7 most negative (-1.0) polarity entries 6 belong to the group ’eos’ All of them are in Spanish (?) G. Heiczman — Sentiment Analysis 22
Conclusion Useful but not universally Lessons learned: • Filter the data-set intelligently • Don’t try to solve everything with one library G. Heiczman — Sentiment Analysis 23
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.