Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Automated Detection of Plagiarism based on Whitespace and History - - PowerPoint PPT Presentation
Automated Detection of Plagiarism based on Whitespace and History - - PowerPoint PPT Presentation
Chair of Network Architectures and Services Department of Informatics Technical University of Munich Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4, 2017 Chair of Network Architectures and
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Contents
The Idea Implementation Evaluation Further Work
- M. Ongyerth
– gitplag 2
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
What we want to find
struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; }
- M. Ongyerth
– gitplag 3
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
What we want to find
struct icmp6_neighbor_solicit ␣ { struct ether_header ehdr; struct ipv6_hdr iphdr; struct neighbor_solicit_payload pay; } __attribute__ (( packed )); if(buffer[offset + 6] != (byte) 0b00111010 ){ return false; } if(buffer[offset + 7] != (byte) 0b11111111 ){ return false; }
- M. Ongyerth
– gitplag 3
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
The GRNVS dataset
2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git
- M. Ongyerth
– gitplag 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
The GRNVS dataset
2016 2017 Assignment 3 4 2 3 Submissions 236 199 355 223 Avg Commits 29 15 27 42 Cases of plagiarism 8 4 4 1 Automatic tests triggered over git
- M. Ongyerth
– gitplag 4
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
In the past
- Checking for plagiarism with MOSS
- Hand check the results
- Search for “strong” evidence by hand
- M. Ongyerth
– gitplag 5
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Our two approaches
Whitespace errors
- Weird/broken indention
- Multiple
- Trailing whitespace
- ^→␣→
- →struct␣␣struct;
- →struct␣struct;␣$
Identifier
- Unintuitive names
- Copies of typos
- 0b1000001
- numericToTextFormat
- java.sql.Time
- M. Ongyerth
– gitplag 6
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Version control history
- Perpetrator try to hide
- They “destroy” evidence
- The (Git-) history preserves evidence
- M. Ongyerth
– gitplag 7
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Implementation
- 1. Read and tokenize submissions
- 2. Filter to viable tokens
- 3. Compare submission pairwise
- 4. Generate report / provide interactive interface
- M. Ongyerth
– gitplag 8
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Differences to other systems
- Whitespace
⇐ usually ignored
- Identifiers
- History
- M. Ongyerth
– gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Differences to other systems
- Whitespace
- Identifiers
⇐ usually ignored
- History
- M. Ongyerth
– gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Differences to other systems
- Whitespace
- Identifiers
- History
⇐ usually not available
- M. Ongyerth
– gitplag 9
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
ROC graphs
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Worse than guessing Better than guessing FPF Sensitivity
- M. Ongyerth
– gitplag 10
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Detection rate (2016 whitespace with git)
0.5 1 1.5 2 2.5 3 ·10−2 0.2 0.4 0.6 0.8 1
15,2 20,2 30,2 5,2
FPF Sensitivity All Viability=5 Viability=15 Viability=30
(a) Assignment 2
0.5 1 1.5 2 2.5 3 ·10−2 0.2 0.4 0.6 0.8 1
10,2 5,2
FPF Sensitivity All Viability=5 Viability=15 Viability=30
(b) Assignment 3
- M. Ongyerth
– gitplag 11
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Detection rate (2017 identifier)
0.5 1 1.5 2 2.5 3 ·10−3 0.2 0.4 0.6 0.8 1
15 20 30 5 10 30 5
FPF Sensitivity Identifier With Git
(c) Assignment 2
0.5 1 1.5 2 2.5 3 ·10−3 0.2 0.4 0.6 0.8 1
15 30 5 30 5
FPF Sensitivity Identifier With Git
(d) Assignment 3
- M. Ongyerth
– gitplag 12
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
It’s not perfect
- Shared external file
- Students worked together
- Incomplete file filter
- M. Ongyerth
– gitplag 13
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Time requirements
Assignment Git Whitespace Identifier 3 No
8s 10s
Yes
18s 24s
4 No
3s 4s
Yes
6s 9s
- M. Ongyerth
– gitplag 14
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Further work
- Improve usable file detection
- Create and evaluate other tokenizing mechanisms
- Some implementation details
- M. Ongyerth
– gitplag 15
Chair of Network Architectures and Services Department of Informatics Technical University of Munich
Related work
- Moss
- Gitplag
- (Docoloc)
- Measuring Whitespace Pattern Sequences as an Indication of Pla-
giarism (Baer et. Al)
- M. Ongyerth