An empirical aproach towards analysis of discussions on mailing - - PowerPoint PPT Presentation

an empirical aproach towards analysis of discussions on
SMART_READER_LITE
LIVE PREVIEW

An empirical aproach towards analysis of discussions on mailing - - PowerPoint PPT Presentation

Chair of Network Architectures and Services Department of Informatics Technical University of Munich An empirical aproach towards analysis of discussions on mailing lists Simon Klimek March 21, 2018 Chair of Network Architectures and Services


slide-1
SLIDE 1

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

An empirical aproach towards analysis of discussions on mailing lists

Simon Klimek

March 21, 2018 Chair of Network Architectures and Services Department of Informatics Technical University of Munich

slide-2
SLIDE 2

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Motivation Related Work Approach Evaluation Future Work Bibliography

  • S. Klimek

– Discussion Analysis 2

slide-3
SLIDE 3

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Background - IETF

Figure 1: IETF Logo

  • Development of standards
  • 121 active working groups
  • RFCs (Request For Comments)
  • S. Klimek

– Discussion Analysis 2

slide-4
SLIDE 4

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Motivation

  • Discussions are held via mailing lists.
  • Can we analyze them automatically?
  • Can the gained data help us to better understand IETF processes?
  • S. Klimek

– Discussion Analysis 3

slide-5
SLIDE 5

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Related Work

  • Conversational Speech
  • S. Klimek

– Discussion Analysis 4

slide-6
SLIDE 6

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Related Work

  • Conversational Speech
  • Formal Speech
  • S. Klimek

– Discussion Analysis 4

slide-7
SLIDE 7

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Related Work

  • Conversational Speech
  • Telephone Conversations (human to human)
  • Online Chats
  • Plan recognition in dialogues
  • Formal Speech
  • S. Klimek

– Discussion Analysis 4

slide-8
SLIDE 8

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Related Work

  • Conversational Speech
  • Telephone Conversations (human to human)
  • Online Chats
  • Plan recognition in dialogues
  • Formal Speech
  • Q&A Forum
  • S. Klimek

– Discussion Analysis 4

slide-9
SLIDE 9

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Conversational Speech

  • Dialogue Acts labeling [6] on the Switchboard corpus [3]
  • Online Chat between multiple participants [7]
  • Plan recognition in dialogues [1]
  • S. Klimek

– Discussion Analysis 5

slide-10
SLIDE 10

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Formal Speech

  • Question - Answer Forums [2]
  • S. Klimek

– Discussion Analysis 6

slide-11
SLIDE 11

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Previous Work

Nikolai Schwellnus’ bachelor thesis "A Heat Map for IETF Standardiza- tion Activities" [5]

  • S. Klimek

– Discussion Analysis 7

slide-12
SLIDE 12

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Previous Work

Nikolai Schwellnus’ bachelor thesis "A Heat Map for IETF Standardiza- tion Activities" [5]

text file integer key real polarity real subjectivity text mostusedword integer sentencecount sentimentvalues text name boolean announce integer id list text messageid text file integer key timestamp with time zone date timestamp date_local text sender_addr text receiver text subject text inreply boolean spam numeric spamscore text sender_name bigint person bigint fast_person mails varchar(788) leaf integer depth mail_threads text messageid text list mail_on_list leaf:messageid messageid:messageid list:name

Figure 2: Database Schemata

  • S. Klimek

– Discussion Analysis 7

slide-13
SLIDE 13

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Finding Discussions

Finding discussion threads?

  • S. Klimek

– Discussion Analysis 8

slide-14
SLIDE 14

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Finding Discussions

Finding discussion threads?

[Doh] operational considerations Eliot Lear Re: [Doh] operational considerations Martin J . Dürst Re: [Doh] operational considerations Jim Reid Re: [Doh] operational considerations Eliot Lear. Re: [Doh] operational considerations Patrick McManus Re: [Doh] operational considerations Jim Reid Re: [Doh] operational considerations Eliot Lear Re: [Doh] operational considerations Patrick McManus Re: [Doh] operational considerations Hewitt, Rory Re: [Doh] operational considerations Eliot Lear Re: [Doh] operational considerations Patrick McManus

Figure 3: Thread Structure

  • S. Klimek

– Discussion Analysis 8

slide-15
SLIDE 15

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Finding Discussions

Finding discussion threads?

20 40 60 80 100 101 102 103 104 105 106 number of mails in one thread

  • ccurences

100 200 300 100 101 102 103 104 105 106 number of replies

  • S. Klimek

– Discussion Analysis 9

slide-16
SLIDE 16

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Finding Discussions

Finding discussion threads?

  • In-Reply-To
  • Thread-View MHonArc1

1https://www.mhonarc.org

  • S. Klimek

– Discussion Analysis 10

slide-17
SLIDE 17

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

WITH RECURSIVE r e p l i e s ( messageid , spam, sender_addr , receiver , depth , i n r e p l y ) as ( SELECT messageid , spam, sender_addr , receiver , 1 as depth , i n r e p l y FROM mails WHERE spam IS FALSE and i n r e p l y IS NULL UNION ALL SELECT

  • m. messageid , m. spam, m. sender_addr ,
  • m. receiver , tm . depth+1 as depth ,

tm . i n r e p l y FROM r e p l i e s tm , mails m WHERE m. i n r e p l y = tm . messageid )

  • S. Klimek

– Discussion Analysis 11

slide-18
SLIDE 18

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Processing Mails

Extract text from mails.

  • 1. Multipurpose Internet Mail Extensions (MIME) [4]
  • S. Klimek

– Discussion Analysis 12

slide-19
SLIDE 19

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Processing Mails

Extract text from mails.

  • 1. Multipurpose Internet Mail Extensions (MIME) [4]
  • text
  • text/plain
  • text/html
  • multipart
  • mixed
  • alternative
  • S. Klimek

– Discussion Analysis 12

slide-20
SLIDE 20

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Processing Mails

Extract text from mails.

  • 1. Multipurpose Internet Mail Extensions (MIME) [4]
  • text
  • text/plain
  • text/html
  • multipart
  • mixed
  • alternative
  • 2. Remove HTML-tags, decode
  • S. Klimek

– Discussion Analysis 12

slide-21
SLIDE 21

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Quotation and Referencing

  • S. Klimek

– Discussion Analysis 13

slide-22
SLIDE 22

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Quotation and Referencing

  • S. Klimek

– Discussion Analysis 13

slide-23
SLIDE 23

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Quotation and Referencing

  • S. Klimek

– Discussion Analysis 13

slide-24
SLIDE 24

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Processing of Mail-Blocks

  • Tokenization
  • Lexical Analysis
  • S. Klimek

– Discussion Analysis 14

slide-25
SLIDE 25

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Further Analysis

  • sentence based
  • dialogue acts
  • S. Klimek

– Discussion Analysis 15

slide-26
SLIDE 26

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Further Analysis

  • sentence based
  • dialogue acts
  • mail-block based
  • subjectivity
  • polarity
  • S. Klimek

– Discussion Analysis 15

slide-27
SLIDE 27

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Framework Overview

  • Finding mail-threads
  • Reading a single mail thread
  • Pipeline for Preprocessing
  • Decoding
  • Mail-block chunking
  • Tokenization
  • Quotation/Referencing
  • Polarity/Subjectivity
  • Analyzer
  • S. Klimek

– Discussion Analysis 16

slide-28
SLIDE 28

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Results - Influential People

1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 notifications@github.com bidulock@openss7.org julian.reschke@gmx.de brian.e.carpenter@gmail.com jari.arkko@piuha.net moore@cs.utk.edu touch@isi.edu christer.holmberg@ericsson.com stephen.farrell@cs.tcd.ie stpeter@stpeter.im pekkas@netcore.fi harald@alvestrand.no trac@tools.ietf.org dhc@dcrocker.net paul.hoffman@vpnc.org alexey.melnikov@isode.com kent@bbn.com j.schoenwaelder@jacobs-university.de martin.thomson@gmail.com mnot@mnot.net magnus.westerlund@ericsson.com henrik@levkowetz.com fluffy@cisco.com fred@cisco.com bclaise@cisco.com alexandru.petrescu@gmail.com ted.lemon@nominum.com john-ietf@jck.com nico@cryptonector.com dotis@mail-abuse.org # final says

  • S. Klimek

– Discussion Analysis 17

slide-29
SLIDE 29

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Results - Influential People

Who starts the longest discussions?

16 18 20 22 24 26 28 30 32 xiangli@seguesoft.com iane@sussex.ac.uk jerome.grenier@bell.ca yuri.ismailov@ericsson.com npowell@harris.com thomas@koch.ro stefan.alfare@swisscom.com tammy_leino@mentor.com segred@ics.forth.gr sroberts@uniserve.com yhirano@google.com jordan.melzer@telus.com cnd@geek.net.au dave.d.smith@alcatel-lucent.com gclark@mti-systems.com phessler@theapt.org shinji.okumura@softfront.jp bart.bogaert@nokia.com mikebianc@aol.com jrn@jrn.me.uk kawashimam@vx.jp.nec.com ankriste@cisco.com

  • wen@delong.com

thomas.haynes@sun.com nataraju.sip@gmail.com andrewmcgr@gmail.com kazu@iij.ad.jp rsk@gsp.org michael@wyraz.de victor@jvknet.com average thread length

  • S. Klimek

– Discussion Analysis 18

slide-30
SLIDE 30

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Results - Unanswered Questions

Questions remain unanswered?

500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700 1,800 1,900 bidulock@openss7.org alexandru.petrescu@gmail.com bclaise@cisco.com tolga.asveren@ss8.com christer.holmberg@ericsson.com brian.e.carpenter@gmail.com pkyzivat@cisco.com julian.reschke@gmx.de notifications@github.com harald@alvestrand.no andy@yumaworks.com pekkas@netcore.fi pthubert@cisco.com kent@bbn.com shares@ndzh.com pkyzivat@alum.mit.edu stephen.farrell@cs.tcd.ie phil.hunt@oracle.com fred.l.templin@boeing.com # unanswered questions at the end of a thread

  • S. Klimek

– Discussion Analysis 19

slide-31
SLIDE 31

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Future Work

  • S. Klimek

– Discussion Analysis 20

slide-32
SLIDE 32

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Future Work

  • Dialogue Act classification
  • Background information about members
  • In-depth analysis of discussions
  • S. Klimek

– Discussion Analysis 20

slide-33
SLIDE 33

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Thank you for your attention!

Feel free to asks questions.

  • S. Klimek

– Discussion Analysis 21

slide-34
SLIDE 34

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

[1] S. Carberry. Plan recognition in natural language dialogue. ACL-MIT Press series in natural language processing. MIT Press, Cambridge,

  • Mass. u.a., 1990.

[2] G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun. Finding question-answer pairs from online forums. In Proceedings of the 31st Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR ’08, pages 467–474, New York, NY, USA, 2008. ACM. [3] J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE In- ternational Conference on, volume 1, pages 517–520. IEEE, 1992.

  • S. Klimek

– Discussion Analysis 22

slide-35
SLIDE 35

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

[4] P . W. Resnick. Internet message format. RFC 5322, RFC Editor, October 2008. http://www.rfc-editor.org/rfc/rfc5322.txt. [5] N. Schwellnus. A heat map for ietf standardization activities, 2016. [6] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P . Taylor,

  • R. Martin, C. V. Ess-Dykema, and M. Meteer.

Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics, 26(3):339–373, 2000. [7] S. Trausan-Matu. Automatic support for the analysis of online collaborative learning chat conversa- tions. In P . Tsang, S. K. S. Cheung, V. S. K. Lee, and R. Huang, editors, Hybrid Learning, pages 383–394, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

  • S. Klimek

– Discussion Analysis 23