Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter - - PowerPoint PPT Presentation

▶

Sep 28, 2022 34 likes •148 views

Text Classification using Weka Jrg Steffen, DFKI Substitute Gnter Neumann, DFKI steffen@dfki.de 10.11.2014 1 Language Technology I - An Introduction to Text Classification - WS 2014/2015 What is Weka? Workbench for machine learning

SLIDE 1

1 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Text Classification using Weka

Jörg Steffen, DFKI Substitute Günter Neumann, DFKI steffen@dfki.de 10.11.2014

SLIDE 2

2 Language Technology I - An Introduction to Text Classification - WS 2014/2015

What is Weka?

Workbench for machine learning and data mining
Supports a large number of ML approaches
Developed by the ML group at the University of Waikato

(NZ)

Implemented in Java
Open Source software under GNU GPL
http://www.cs.waikato.ac.nz/~ml/weka/index.html

SLIDE 3

3 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Weka Datasets

Used for training and testing
Collection of examples

attributes with values

Represented as ARFF file

ARFF: attribute-relation file format header with attribute types

nominal finite set of strings
numeric
string
date

example instances as comma-separated list of attribute values

SLIDE 4

4 Language Technology I - An Introduction to Text Classification - WS 2014/2015

ARFF Example

@relation golf_weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute playGolf {yes, no} @data sunny, 29, 85, false, no sunny, 27, 90, true, no

vercast,

28, 86, false, yes rainy, 21, 96, false, yes rainy, 20, 80, false, yes rainy, 18, 70, true, no

vercast,

17, 65, true, yes sunny, 22, 95, false, no sunny, 21, 70, false, yes rainy, 21, 80, false, yes sunny, 24, 70, true, yes

vercast,

22, 90, true, yes

vercast,

27, 75, false, yes rainy, 22, 91, true, no

Header Instances

SLIDE 5

5 Language Technology I - An Introduction to Text Classification - WS 2014/2015

J48 Decision Tree

> java -cp weka-3.6.3.jar weka.classifiers.trees.J48 -t weather.arff –i J48 pruned tree

utlook = sunny

| humidity <= 75: yes (2.0) | humidity > 75: no (3.0)

utlook = overcast: yes (4.0)
utlook = rainy

| windy = true: no (2.0) | windy = false: yes (3.0) Number of Leaves : 5 Size of the tree : 8 === Error on training data === Correctly Classified Instances 14 100 % Incorrectly Classified Instances 0 0 %

SLIDE 6

6 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Vector-Based Text Classification

Document features as numeric Weka attributes
Feature weight as attribute values
Document class as last Weka attribute
Example instances as feature vectors followed by

document class

@attribute ‘I' numeric @attribute ‘walk' numeric @attribute ‘drive' numeric @attribute moving_type {walking, driving} @data 1,1,0,walking 1,0,1,driving

SLIDE 7

7 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification

Classes: 12 languages

German (de) Italian (it) Catalan (ca) Norwegian (no) Finnish (fi) Danish (dk) Sorbian (sb) Swedish (sv) French (fr) English (en) Estonian (et) Dutch (nl)

http://corpora.uni-leipzig.de/download.html
Features: character unigrams and bigrams

SLIDE 8

8 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification

Training data: 1000 sentences per language

train.arff

Test data: 500 sentences per language

test.arff

Features selection using corpus frequency >= 4

4764 total features, 1845 filtered 2919 features left

Feature weight: tf.idf

SLIDE 9

9 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification ARFF File

... @attribute 'Ru' numeric @attribute 'Ry' numeric @attribute 'Rà' numeric @attribute 'Rä' numeric @attribute 'Rå' numeric @attribute 'Ré' numeric ... @attribute lang {de,it,ca,no,fi,dk,sb,sv,fr,en,et,nl} @data ... 0,0,14.2323,0,0,7.456, ..., de ...

SLIDE 10

10 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification Results

> java -Xms2048m -Xmx2048m -Dfile.encoding=utf-8 -cp weka-3.6.3.jar \ weka.classifiers.bayes.NaiveBayes –t train.arff –T test.arff Time taken to build model: 9.57 seconds Time taken to test model on training data: 101.29 seconds === Error on test data === Correctly Classified Instances 5514 91.9 % Incorrectly Classified Instances 486 8.1 % ... Total Number of Instances 6000 === Confusion Matrix === a b c d e f g h i j k l <-- classified as 479 0 1 3 0 0 3 3 0 3 0 8 | a = de 0 479 5 4 0 1 6 1 0 4 0 0 | b = it 9 6 445 3 0 0 5 6 8 6 0 12 | c = ca 12 0 3 388 0 72 1 17 0 2 0 5 | d = no 2 1 0 2 487 0 0 4 0 0 3 1 | e = fi 4 1 2 73 1 393 0 8 0 9 1 8 | f = dk 3 0 0 1 1 1 492 0 0 1 1 0 | g = sb 6 0 0 11 1 10 0 461 0 8 0 3 | h = sv 3 0 13 5 0 0 2 1 453 4 0 19 | i = fr 3 0 1 4 0 2 3 2 0 464 0 21 | j = en 1 0 0 1 1 0 2 1 1 2 489 2 | k = et 7 0 0 1 0 0 1 1 2 4 0 484 | l = nl

SLIDE 11

11 Language Technology I - An Introduction to Text Classification - WS 2014/2015

Language Identification Results

> java -Xms2048m -Xmx2048m -Dfile.encoding=utf-8 -cp weka-3.6.3.jar \ weka.classifiers.functions.SMO -t train.arff –T test.arff Time taken to build model: 94.77 seconds Time taken to test model on training data: 23.07 seconds === Error on test data === Correctly Classified Instances 5703 95.05 % Incorrectly Classified Instances 297 4.95 % ... Total Number of Instances 6000 === Confusion Matrix === a b c d e f g h i j k l <-- classified as 497 0 0 2 0 0 1 0 0 0 0 0 | a = de 0 490 6 0 0 1 0 0 2 1 0 0 | b = it 0 8 486 1 0 1 0 1 2 1 0 0 | c = ca 9 3 1 431 1 43 0 8 1 2 0 1 | d = no 1 1 0 2 492 0 0 3 0 0 1 0 | e = fi 4 1 1 84 0 402 0 5 0 1 0 2 | f = dk 3 4 1 2 0 1 483 1 1 0 4 0 | g = sb 4 1 4 15 0 5 0 468 1 1 1 0 | h = sv 0 2 2 0 0 0 0 0 492 2 0 2 | i = fr 1 2 6 2 0 0 0 1 3 485 0 0 | j = en 1 0 1 0 2 0 0 0 0 0 496 0 | k = et 4 1 1 1 0 2 0 0 6 4 0 481 | l = nl