Hacking a Way Through the Twitter Language Jungle: Syntactic - - PowerPoint PPT Presentation

hacking a way through the twitter language jungle
SMART_READER_LITE
LIVE PREVIEW

Hacking a Way Through the Twitter Language Jungle: Syntactic - - PowerPoint PPT Presentation

Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets - 8 2 9 5 3 8 / l m m o t h c . . r x e i f p a a r p g l l l e a e w f / - e / l : g p


slide-1
SLIDE 1

Nathan Schneider • NLPIT , Rotterdam • June 23, 2015

h t t p : / / f e e l g r a f i x . c

  • m

/ 8 3 5 9 2 8

  • j

u n g l e

  • w

a l l p a p e r . h t m l

Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets

slide-2
SLIDE 2

Edited Text

2

slide-3
SLIDE 3

Conversational Web Text

3

#jesuischarlie <3333 wut! u da man! *fist pump*

slide-4
SLIDE 4

4

RICHNESS ROBUSTNESS

syntactic parsing semantic parsing NER POS

slide-5
SLIDE 5

5

representation annotation automation

slide-6
SLIDE 6

6

representation annotation automation

slide-7
SLIDE 7

Outline

7

  • Twitter POS tagging
  • Twitter dependency parsing
  • What’s next?
slide-8
SLIDE 8

Twitter

8

multi-word abbreviations non-standard spellings hashtags

Also: at-mentions, URLs, emoticons, symbols, typos, etc.

slide-9
SLIDE 9

Twitter POS

  • Part-of-speech tagging for Twitter:

annotation, features, and experiments. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A.

  • Smith. ACL-HLT 2011.
  • Improved part-of-speech tagging for online

conversational text with word clusters. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. NAACL-HLT 2013.

9

slide-10
SLIDE 10

Twitter POS: Representation

10

lexical & punctuation common noun determiner proper noun preposition pronoun verb particle verb coordinating conjunction adjective numeral adverb interjection punctuation predeterminer / existential there complex

nominal+possessive (his, books’) proper noun+possessive (Mary’s book) nominal+verbal (ur, ima) proper noun+verbal (Mary’s happy) existential+verbal (there’s)

slide-11
SLIDE 11

Twitter POS: Representation

11

Twitter-specific hashtag

#mcconnelling

username

@justinbieber

URL/email

cnn.com bob@bob.com

emoticon

:-) \o/

Twitter discourse marker

RT <—

  • ther

ily mfw ™

slide-12
SLIDE 12

Twitter POS: Annotation

13

  • 17 annotators
slide-13
SLIDE 13

15

83.0 85.5 88.0 90.5 93.0 92.2 83.0 83.0 83.0

Twitter POS: Annotation

Inter-annotator 
 agreement

slide-14
SLIDE 14

16

83.0 85.5 88.0 90.5 93.0 92.2 83.0 83.0 83.0 83.0 85.5 88.0 90.5 93.0 92.2 89.4 83.4 85.9

Twitter POS: Automation

  • incl. special regexes,

distributional similarity, phonetic normalization, tag dictionary

Our CRF, 
 base features Our CRF, 
 all features Inter-annotator 
 agreement Stanford
 tagger

slide-15
SLIDE 15

17

Can we do better?

  • No explicit annotation guidelines

document → some conventions unclear.

  • Tokenization difficult due to creative
  • emoticons. :~P \o/
  • Accuracy still lags on rare/OOV words.
  • CRF is too slow to tag huge volumes of

text.

slide-16
SLIDE 16

18

Annotation Conventions

  • Jesse & the Rippers


the California Chamber of Commerce

  • All and only nouns within proper names

tagged as proper noun

  • (1) this wind is serious


(2) i just orgasmed over this
 (3) You should know, that if you come any closer …

  • (1): determiner, (2): pronoun, 


(3): preposition/subordinator

  • Gimpel et al. annotations were

inconsistent, so we corrected them

slide-17
SLIDE 17

19

Annotation Conventions

  • RT @anonuser : Tonight’s memorial for

Lucas Ransom starts at 8:00 p.m. and is being held at the open space at the corner

  • f Del Pla ...
  • Proper noun (truncated, but we can tell

from context)

  • (1) I need to go home man .


(2) * Bbm yawn face * Man that #napflow felt so refreshing .

  • (1): noun, (2): interjection
slide-18
SLIDE 18

20

Improved Tokenizer

  • Rule-based, with regular expressions for

faces, etc.
 
 :*O
 


  • _-



 <3333

  • Also better URL patterns: about.me
slide-19
SLIDE 19

21

Word Clusters

  • Brown clusters help smooth over lexicon

to better accommodate rare/OOV words

  • We trained 1000 clusters on 56M English

tweets (847M tokens) spread over 4 years

slide-20
SLIDE 20

23

Word Clusters

Binary path Top words (by frequency) A1

111010100010

lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2

111010100011

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

A3

111010100100

yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

A4

111010100101

yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

A5

11101011011100

smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B

011101011

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C

11100101111001

w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D

111101011000

facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1

0011001

tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qon

E2

0011000

gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F

0110110111

soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1

11101011001010

;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-d

G2

11101011001011

:) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))

G3

1110101100111

:( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fml

G4

111010110001

<3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

laughter hearts/love symbols

slide-21
SLIDE 21

24

Word Clusters

Binary path Top words (by frequency) A1

111010100010

lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2

111010100011

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

A3

111010100100

yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

A4

111010100101

yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

A5

11101011011100

smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B

011101011

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C

11100101111001

w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D

111101011000

facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1

0011001

tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qon

E2

0011000

gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F

0110110111

soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1

11101011001010

;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-d

G2

11101011001011

:) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))

G3

1110101100111

:( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fml

G4

111010110001

<3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

hearts/love symbols + faces laughter

slide-22
SLIDE 22

25

Word Clusters

Binary path Top words (by frequency) A1

111010100010

lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2

111010100011

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

A3

111010100100

yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

A4

111010100101

yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

A5

11101011011100

smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B

011101011

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C

11100101111001

w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D

111101011000

facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1

0011001

tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qon

E2

0011000

gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F

0110110111

soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1

11101011001010

;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-d

G2

11101011001011

:) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))

G3

1110101100111

:( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fml

G4

111010110001

<3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

hearts/love symbols + faces laughter + interjections

slide-23
SLIDE 23

26

Word Clusters

Binary path Top words (by frequency) A1

111010100010

lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2

111010100011

haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahah

A3

111010100100

yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yus

A4

111010100101

yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh nooooo

A5

11101011011100

smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B

011101011

u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C

11100101111001

w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D

111101011000

facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1

0011001

tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qon

E2

0011000

gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F

0110110111

soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1

11101011001010

;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-d

G2

11101011001011

:) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))

G3

1110101100111

:( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fml

G4

111010110001

<3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

hearts/love symbols + faces laughter + interjections

slide-24
SLIDE 24

27

Word Clusters

Feature set OCT27TEST DAILY547 NPSCHATTEST All features 91.60 92.80 91.19 with clusters; without tagdicts, namelists 91.15 92.38 90.66 without clusters; with tagdicts, namelists 89.81 90.81 90.00

  • nly clusters (and transitions)

89.50 90.54 89.55 without clusters, tagdicts, namelists 86.86 88.30 88.26 Gimpel et al. (2011) version 0.2 88.89 89.17 Inter-annotator agreement (Gimpel et al., 2011) 92.2 Model trained on all OCT27 93.2

slide-25
SLIDE 25

28

Word Clusters

Highest Weighted Clusters

  • r n & and

103 & 100110*

you yall u it mine everything nothing something anyone someone everyone nobody

899 O 11101*

do did kno know care mean hurts hurt say realize believe worry understand forget agree remember love miss hate think thought knew hope wish guess bet have

29267 V 01*

the da my your ur our their his

378 D 1101*

young sexy hot slow dark low interesting easy important safe perfect special different random short quick bad crazy serious stupid weird lucky sad

6510 A 111110*

x <3 :d :p :) :o :/

2798 E 1110101100*

i'm im you're we're he's there's its it's

428 L 11000*

lol lmao haha yes yea oh omg aww ah btw wow thanks sorry congrats welcome yay ha hey goodnight hi dear please huh wtf exactly idk bless whatever well ok

8160 ! 11101010*

Most common word in each cluster with prefix Types Tag Cluster prefix

slide-26
SLIDE 26

29

Speed

  • Tokenizer: 3500 tweets/s
  • MEMM instead of CRF is much faster
  • Greedy: 800 tweets/s (10k w/s), barely

any loss in accuracy relative to Viterbi

slide-27
SLIDE 27

Outline

30

  • Twitter POS tagging
  • Twitter dependency parsing
  • What’s next?
slide-28
SLIDE 28

Twitter Syntax: Representation

31

OMG I <3 the Biebs & want to have his babies ! —> LA Times: Teen Pop Star Heartthrob is All the Rage on Social Media… #belieber

slide-29
SLIDE 29

Twitter Syntax: Representation

32

OMG I <3 the Biebs & want to have his babies ! —> LA Times : Teen Pop Star Heartthrob is All the Rage on Social Media … #belieber

slide-30
SLIDE 30

Twitter Syntax: Representation

33

OMG I <3 the Biebs & want to have his babies ! —> LA Times : Teen Pop Star Heartthrob is All the Rage on Social Media … #belieber

slide-31
SLIDE 31

Twitter Syntax: Representation

34

OMG I <3 the Biebs & want to have his babies LA Times Teen Pop Star Heartthrob is All the Rage on Social Media

slide-32
SLIDE 32

Twitter Syntax: Representation

35

OMG I <3 the Biebs & want to have his babies LA_Times Teen Pop Star Heartthrob is All_the_Rage on Social Media

slide-33
SLIDE 33

Twitter Syntax: Representation

36

OMG I <3 the Biebs & want to have his babies LA_Times Teen Pop Star Heartthrob is All_the_Rage on Social Media

  • Fragmentary Unlabeled Dependency

Grammar (“FUDG”; Schneider et al. 2013)

  • allows utterance segmentation, token

selection, MWEs, coordination, underspecification


slide-34
SLIDE 34

Twitter Syntax: Representation

37

OMG I <3 the Biebs & want to have his babies LA_Times Teen Pop Star Heartthrob is All_the_Rage on Social Media

  • FUDG can be rendered in ASCII (“GFL”):



 Teen > (Pop > Star) > Heartthrob
 Heartthrob > is** < [All the Rage]
 is < on < (Social > Media)

slide-35
SLIDE 35

Twitter Syntax: Annotation

38

1 day of annotation, 26 participants

  • Custom web-based annotation tool

(Mordowanec et al., ACL 2014 demo)

slide-36
SLIDE 36

Twitter Syntax: Automation

  • A supervised discriminative graph-based

parser for tweets (Kong et al., EMNLP 2014)

  • (1) lexical selection: a sequence model
  • (2) parsing: a 2nd-order TurboParser model
  • produces FUDG parses (incl. coordination,

multiple utterances, MWEs)

  • Experiments
  • train on PTB, test on tweets: 73% UAS
  • train on 1,473 English tweets (9k tokens): 80%
  • domain adaptation (train on tweets, with some

features derived from PTB-trained parser): 81%

39

slide-37
SLIDE 37

Twitter POS & Syntax: Summary

  • Modified traditional representations to meet

the needs of our domain & process

  • Rapid annotation by (mostly) CS grad

students, informed by linguistics

  • Widely downloaded (>3,000), state-of-the-art

POS tagger for Twitter; parser will be released in time for EMNLP

  • Syntactic representations & annotation tools

inspired by Twitter are now being used for Wikipedia, low-resource African languages, and even Shakespeare!

40

slide-38
SLIDE 38

Twitter POS & Syntax: Links

  • http://www.ark.cs.cmu.edu/TweetNLP/
  • http://www.ark.cs.cmu.edu/FUDG/

41

slide-39
SLIDE 39

42

representation annotation automation

slide-40
SLIDE 40

What’s Next?

  • More languages/genres?
  • Richer/deeper representations?
  • Applications?

43

slide-41
SLIDE 41

44

RICHNESS ROBUSTNESS

syntactic parsing semantic parsing NER POS

slide-42
SLIDE 42

45

thx