Automated Detection of Plagiarism based on Whitespace and History - - PowerPoint PPT Presentation

automated detection of plagiarism based on whitespace and
SMART_READER_LITE
LIVE PREVIEW

Automated Detection of Plagiarism based on Whitespace and History - - PowerPoint PPT Presentation

Chair of Network Architectures and Services Department of Informatics Technical University of Munich Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4, 2017 Chair of Network Architectures and


slide-1
SLIDE 1

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Automated Detection of Plagiarism based on Whitespace and History

Markus Ongyerth

December 4, 2017 Chair of Network Architectures and Services Department of Informatics Technical University of Munich

slide-2
SLIDE 2

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Contents

The Idea Implementation Evaluation Further Work

  • M. Ongyerth

– gitplag 2

slide-3
SLIDE 3

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

What we want to find

struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; }

  • M. Ongyerth

– gitplag 3

slide-4
SLIDE 4

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

What we want to find

struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; }

  • M. Ongyerth

– gitplag 3

slide-5
SLIDE 5

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

The GRNVS dataset

2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git

  • M. Ongyerth

– gitplag 4

slide-6
SLIDE 6

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

The GRNVS dataset

2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git

  • M. Ongyerth

– gitplag 4

slide-7
SLIDE 7

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

In the past

  • Checking for plagiarism with MOSS
  • Hand check the results
  • Search for “strong” evidence by hand
  • M. Ongyerth

– gitplag 5

slide-8
SLIDE 8

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Our two approaches

Whitespace errors

  • Weird/broken indention
  • Multiple
  • Trailing whitespace
  • ^→␣→
  • →struct␣␣struct;
  • →struct␣struct;␣$

Identifier

  • Unintuitive names
  • Copies of typos
  • 0b1000001
  • numericToTextFormat
  • java.sql.Time
  • M. Ongyerth

– gitplag 6

slide-9
SLIDE 9

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Version control history

  • Perpetrator try to hide
  • They “destroy” evidence
  • The (Git-) history preserves evidence
  • M. Ongyerth

– gitplag 7

slide-10
SLIDE 10

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Implementation

  • 1. Read and tokenize submissions
  • 2. Filter to viable tokens
  • 3. Compare submission pairwise
  • 4. Generate report / provide interactive interface
  • M. Ongyerth

– gitplag 8

slide-11
SLIDE 11

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Differences to other systems

  • Whitespace

⇐ usually ignored

  • Identifiers
  • History
  • M. Ongyerth

– gitplag 9

slide-12
SLIDE 12

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Differences to other systems

  • Whitespace
  • Identifiers

⇐ usually ignored

  • History
  • M. Ongyerth

– gitplag 9

slide-13
SLIDE 13

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Differences to other systems

  • Whitespace
  • Identifiers
  • History

⇐ usually not available

  • M. Ongyerth

– gitplag 9

slide-14
SLIDE 14

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

ROC graphs

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Worse than guessing Better than guessing FPF Sensitivity

  • M. Ongyerth

– gitplag 10

slide-15
SLIDE 15

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Detection rate (2016 whitespace with git)

0.5 1 1.5 2 2.5 3 ·10−2 0.2 0.4 0.6 0.8 1

15,2 20,2 30,2 5,2

FPF Sensitivity All Viability=5 Viability=15 Viability=30

(a) Assignment 2

0.5 1 1.5 2 2.5 3 ·10−2 0.2 0.4 0.6 0.8 1

10,2 5,2

FPF Sensitivity All Viability=5 Viability=15 Viability=30

(b) Assignment 3

  • M. Ongyerth

– gitplag 11

slide-16
SLIDE 16

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Detection rate (2017 identifier)

0.5 1 1.5 2 2.5 3 ·10−3 0.2 0.4 0.6 0.8 1

15 20 30 5 10 30 5

FPF Sensitivity Identifier With Git

(c) Assignment 2

0.5 1 1.5 2 2.5 3 ·10−3 0.2 0.4 0.6 0.8 1

15 30 5 30 5

FPF Sensitivity Identifier With Git

(d) Assignment 3

  • M. Ongyerth

– gitplag 12

slide-17
SLIDE 17

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

It’s not perfect

  • Shared external file
  • Students worked together
  • Incomplete file filter
  • M. Ongyerth

– gitplag 13

slide-18
SLIDE 18

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Time requirements

Assignment Git Whitespace Identifier 3 No

8s 10s

Yes

18s 24s

4 No

3s 4s

Yes

6s 9s

  • M. Ongyerth

– gitplag 14

slide-19
SLIDE 19

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Further work

  • Improve usable file detection
  • Create and evaluate other tokenizing mechanisms
  • Some implementation details
  • M. Ongyerth

– gitplag 15

slide-20
SLIDE 20

Chair of Network Architectures and Services Department of Informatics Technical University of Munich

Related work

  • Moss
  • Gitplag
  • (Docoloc)
  • Measuring Whitespace Pattern Sequences as an Indication of Pla-

giarism (Baer et. Al)

  • M. Ongyerth

– gitplag 16