Weak Semi-Markov CRFs for NP Chunking in Informal Text Aldrian - - PowerPoint PPT Presentation

weak semi markov crfs for np chunking in informal text
SMART_READER_LITE
LIVE PREVIEW

Weak Semi-Markov CRFs for NP Chunking in Informal Text Aldrian - - PowerPoint PPT Presentation

Contributions Dataset Models Experiments Conclusion Weak Semi-Markov CRFs for NP Chunking in Informal Text Aldrian Obaja Muis and Wei Lu Singapore University of Technology and Design Contributions Dataset Models Experiments Conclusion


slide-1
SLIDE 1

Contributions Dataset Models Experiments Conclusion

Weak Semi-Markov CRFs for NP Chunking in Informal Text

Aldrian Obaja Muis and Wei Lu

Singapore University of Technology and Design

slide-2
SLIDE 2

Contributions Dataset Models Experiments Conclusion

Paper Contributions

In this paper, we contributed:

1 Noun phrase-annotated SMS corpus1 1Tao Chen and Min-Yen Kan (2013). “Creating a live, public short message

service corpus: the NUS SMS corpus”. In: Language Resources and

  • Evaluation. Vol. 47. Springer Netherlands, pp. 299–335.

2 / 13

slide-3
SLIDE 3

Contributions Dataset Models Experiments Conclusion

Paper Contributions

In this paper, we contributed:

1 Noun phrase-annotated SMS corpus1 2 Weak semi-Markov CRF 1Tao Chen and Min-Yen Kan (2013). “Creating a live, public short message

service corpus: the NUS SMS corpus”. In: Language Resources and

  • Evaluation. Vol. 47. Springer Netherlands, pp. 299–335.

2 / 13

slide-4
SLIDE 4

Contributions Dataset Models Experiments Conclusion

NP-annotated SMS Corpus

3 / 13

slide-5
SLIDE 5

Contributions Dataset Models Experiments Conclusion

NP-annotated SMS Corpus

We used Brat Rapid Annotation Tool (BRAT)2 for annotations, recruiting undergraduate students to annotate the noun phrases.

2http://brat.nlplab.org/

4 / 13

slide-6
SLIDE 6

Contributions Dataset Models Experiments Conclusion

NP-annotated SMS Corpus

We used Brat Rapid Annotation Tool (BRAT)2 for annotations, recruiting undergraduate students to annotate the noun phrases. Examples:

2http://brat.nlplab.org/

4 / 13

slide-7
SLIDE 7

Contributions Dataset Models Experiments Conclusion

NP-annotated SMS Corpus

We used Brat Rapid Annotation Tool (BRAT)2 for annotations, recruiting undergraduate students to annotate the noun phrases. Examples:

2http://brat.nlplab.org/

4 / 13

slide-8
SLIDE 8

Contributions Dataset Models Experiments Conclusion

Annotations Statistics

64

annotators 5 / 13

slide-9
SLIDE 9

Contributions Dataset Models Experiments Conclusion

Annotations Statistics

64

annotators

26,500

SMS messages 5 / 13

slide-10
SLIDE 10

Contributions Dataset Models Experiments Conclusion

Annotations Statistics

64

annotators

26,500

SMS messages

76,490

noun phrases 5 / 13

slide-11
SLIDE 11

Contributions Dataset Models Experiments Conclusion

Annotations Statistics

64

annotators

26,500

SMS messages

76,490

noun phrases

359,009

tokens 5 / 13

slide-12
SLIDE 12

Contributions Dataset Models Experiments Conclusion

Models

6 / 13

slide-13
SLIDE 13

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

7 / 13

slide-14
SLIDE 14

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

7 / 13

slide-15
SLIDE 15

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

7 / 13

slide-16
SLIDE 16

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

7 / 13

slide-17
SLIDE 17

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

7 / 13

slide-18
SLIDE 18

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

N N N N N N O O O O O O said Dr Teh

  • Fig. 3: Weak Semi-CRF: O(n |Y|2 + nL |Y|)

7 / 13

slide-19
SLIDE 19

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

N N N N N N O O O O O O said Dr Teh

  • Fig. 3: Weak Semi-CRF: O(n |Y|2 + nL |Y|)

7 / 13

slide-20
SLIDE 20

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

N N N N N N O O O O O O said Dr Teh

  • Fig. 3: Weak Semi-CRF: O(n |Y|2 + nL |Y|)

7 / 13

slide-21
SLIDE 21

Contributions Dataset Models Experiments Conclusion

Models Comparison

n : # words in the sentence, |Y| : # labels, L : max segment length

B B B I I I O O O said Dr Teh

  • Fig. 1: Linear CRF: O(n|Y|2)

N N N O O O said Dr Teh

  • Fig. 2: Semi-CRF: O(nL |Y|2)

N N N N N N O O O O O O said Dr Teh

  • Fig. 3: Weak Semi-CRF: O(n |Y|2 + nL |Y|)

7 / 13

slide-22
SLIDE 22

Contributions Dataset Models Experiments Conclusion

Empirical Verification

8 / 13

slide-23
SLIDE 23

Contributions Dataset Models Experiments Conclusion

F1-Score

Basic features +affixes All features 50 60 70 80

71.19 72.49 72.68 74.37 74.69 74.58 74.39 74.60 74.31

F1-Score (%) Linear CRF Semi-CRF Weak Semi-CRF 9 / 13

slide-24
SLIDE 24

Contributions Dataset Models Experiments Conclusion

Training Speed

5,000 10,000 15,000 20,000 0.5 1 1.5 2 # training instances (SMS)

  • Avg. time per iteration (s)

Linear-CRF Semi-CRF Weak Semi-CRF 10 / 13

slide-25
SLIDE 25

Contributions Dataset Models Experiments Conclusion

Conclusion

11 / 13

slide-26
SLIDE 26

Contributions Dataset Models Experiments Conclusion

Conclusion

We have created a new NP-annotated dataset on informal text 12 / 13

slide-27
SLIDE 27

Contributions Dataset Models Experiments Conclusion

Conclusion

We have created a new NP-annotated dataset on informal text We can split the decisions of selecting segment length and segment type to improve the training time, while maintaining similar accuracy 12 / 13

slide-28
SLIDE 28

Contributions Dataset Models Experiments Conclusion

Thank You

Code and data available at: http://statnlp.org/research/ie/ Aldrian Obaja Muis and Wei Lu

Singapore University of Technology and Design

13 / 13