Graph-based and Lexical-Syntactic Approaches for the Authorship - - PowerPoint PPT Presentation

graph based and lexical syntactic approaches for the
SMART_READER_LITE
LIVE PREVIEW

Graph-based and Lexical-Syntactic Approaches for the Authorship - - PowerPoint PPT Presentation

Introduction Proposed approaches Experimental settings and results Universidad Aut onoma de Puebla Conclusion Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task Notebook for PAN at CLEF 2012 Esteban Castillo,


slide-1
SLIDE 1

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task

Notebook for PAN at CLEF 2012 Esteban Castillo, Darnes Vilari˜ no, David Pinto, Iv´ an Olmos, Jes´ us A. Gonz´ alez and Maya Carrillo

September 12, 2012

BUAP NLP September 12, 2012 Traditional Authorship Attribution 1 / 14

slide-2
SLIDE 2

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Index

Introduction Proposed approaches Experimental settings and results Conclusion

BUAP NLP September 12, 2012 Traditional Authorship Attribution 2 / 14

slide-3
SLIDE 3

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Traditional Authorship Attribution

  • Authorship attribution assumes unique and identifiable

writeprints in text.

  • The importance of finding the correct features for

characterizing the signature or particular writing style of a given author is fundamental

BUAP NLP September 12, 2012 Traditional Authorship Attribution 3 / 14

slide-4
SLIDE 4

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Lexical-syntactic approach: features

1 Phrase level features

  • Word prefixes

⋄ e.g. ad → {advance, adjunct, adulterate}

  • Word sufixes

⋄ e.g. est → {finest, toughest, biggest}

  • Stopwords

⋄ e.g. {and, the, but, did}

  • Trigrams of PoS

⋄ e.g. she:PRP drove:VBD a:DT silver:NN pt:NN cruiser:NN {(PRP, VBD, DT), (VBD, DT, NN), (DT, NN, NN), (NN, NN, NN)}

2 Character level features

  • Vowel combination

⋄ e.g. influential → iueia → iuea

  • Vowel permutation

⋄ e.g. influential → iueia

BUAP NLP September 12, 2012 Traditional Authorship Attribution 4 / 14

slide-5
SLIDE 5

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Lexical-syntactic approach: text representation

  • Training stage:

(x1, x2, x3, . . . , xs

  • Feature 1

, · · · , y1, y2, y3, . . . , ym

  • Feature n

, C)

  • Testing stage:

(x1, x2, x3, . . . , xs

  • Feature 1

, · · · , y1, y2, y3, . . . , ym

  • Feature n

)

BUAP NLP September 12, 2012 Traditional Authorship Attribution 5 / 14

slide-6
SLIDE 6

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Lexical-syntactic approach: Classification process

Feature Extraction Model

TEST

Feature Extraction Feature Extraction

TRAINING

Classification algorithm

Result Training Test Classification

. . . BUAP NLP September 12, 2012 Traditional Authorship Attribution 6 / 14

slide-7
SLIDE 7

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Graph-based approach: features

  • In this approach, a graph based representation is considered.
  • Each text paragraph is tagged with its corresponding PoS tags

with the TreeTagger tool.

  • Each word is stemmed using the Porter stemmer.
  • In the graph representation each vertex is considered to be a

stemmed word and each edge is considered to be its corresponding PoS tag.

  • The word sequence of the paragraphs to be represented is

kept.

  • Once each paragraph is represented by means of a graph, we

apply a data mining algorithm called SUBDUE in order to find the most representative words of an author

BUAP NLP September 12, 2012 Traditional Authorship Attribution 7 / 14

slide-8
SLIDE 8

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Graph-based approach: example

  • “second qualifier long road leading 1998 world cup”.

BUAP NLP September 12, 2012 Traditional Authorship Attribution 8 / 14

slide-9
SLIDE 9

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Graph-based approach: text representation

  • Training stage:

D = ( x1, x2, x3, . . . , xn

  • Words obtained from SUBDUE

, C)

  • Testing stage:

D = ( x1, x2, x3, . . . , xn

  • Words obtained from SUBDUE

)

BUAP NLP September 12, 2012 Traditional Authorship Attribution 9 / 14

slide-10
SLIDE 10

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Graph-based approach: Classification process

Model Classification algorithm

Test Training Classification Result

BUAP NLP September 12, 2012 Traditional Authorship Attribution 10 / 14

slide-11
SLIDE 11

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Experimental settings

  • For SUBDUE we extract the 30 most representative words
  • For the problems A, B, C, D, I and J we used WEKA’s

implementation of SVMs

  • Kernell = polynomial mapping
  • For the problems E and F, we used WEKA’s implementation

K-means clustering method

  • K = 2,3 or 4 authors

BUAP NLP September 12, 2012 Traditional Authorship Attribution 11 / 14

slide-12
SLIDE 12

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Results

Results obtained in the traditional sub-task

Task A correct/A% B correct/B% C correct/C% D correct/D% I correct/I% J correct/J% Graph-based approach 5/83.333 6/60 5/62.5 4/23.529 8/57.142 13/81.25 Lexical-syntactic approach 4/66.666 3/30 2/25 6/35.294 10/71.428 7/43.75

Results obtained in the clustering sub-task

Task E correct/E% F correct/F% Graph-based approach 68/75.555 43/53.75 Lexical-Syntactic approach 61/67.777 51/63.75

BUAP NLP September 12, 2012 Traditional Authorship Attribution 12 / 14

slide-13
SLIDE 13

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Concluding remarks

1 Lessons learned

  • The lexical-syntactic feature approach helped to represent the

writing style

  • the graph-based representation obtained a better performance

than the other one. However, more investigation on the graph representation is still required

2 Current work

  • Other data sets and tasks
  • Still more lexical-syntactic features to design and use
  • Understand better the role of the Graph representation
  • Experiment with different graph based text representations

that allow us to obtain much more complex patterns.

BUAP NLP September 12, 2012 Traditional Authorship Attribution 13 / 14

slide-14
SLIDE 14

Introduction Proposed approaches Experimental settings and results Conclusion

Universidad Aut´

  • noma de Puebla

Thank you for your attention!

BUAP NLP September 12, 2012 Traditional Authorship Attribution 14 / 14