Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan
www.hljit.edu.cn PAN@CLEF2013
Approaches for Source Retrieval and Text Alignment of Plagiarism Detection
1
Approaches for Source Retrieval and Text Alignment of Plagiarism - - PowerPoint PPT Presentation
Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1 Who are we? 2 Who are we? 2 Who are we? 2 Who are we?
Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan
www.hljit.edu.cn PAN@CLEF2013
1
2
2
2
Heilongjiang Institute of Technology Harbin, Heilongjiang Province, China
2
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
3
Approaches for Source Retrieval Approaches for Text Alignment Further works Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
4
13
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
Candidate Documents Suspicious plagiarism text Document Set Internet Resource Text Alignment Source Retrieval Suspicious document Query keywords
14
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
Candidate Documents Suspicious plagiarism text Document Set Internet Resource Text Alignment
Source Retrieval
Suspicious document Query keywords
Tow core problem of source retrieval
Retrieval source is millions of documents
This work was done by PAN
The query keywords of suspicious document
How to extract query keyword is one of important
6
16
Query Keywords Extraction Based on TF-IDF Query Keywords Extraction Based on Weighted
Adjacent Query Keywords Extraction by PatTree Combination of Queries and Execution of Retrieval
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
17
TF - term frequency, denotes the frequency of term i in
IDF - inverse document frequency
TF-IDF of term i is: Tips: we found that the top 10 terms with the highest TF-IDF
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
18
Weighted TF-IDF Where weight is a weighted parameter, and we calculate
Tips: the keywords extraction based on the weighted TF-IDF
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
19
The adjacent string with high frequency is more
We use PatTree - an efficient data structure– to get the
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013 example
20
Query Query Keywords 1 2 3 4 5 6 7 8 9 Top 1 to 5 query keywords based on TF-IDF Top 2 to 10 query keywords based on TF-IDF 2-Gram query keywords based on PatTree 3-Gram query keywords based on PatTree 4-Gram query keywords based on PatTree 4-Gram query keywords based on PatTree Top 1 to 5 query keywords based on weighted TF-IDF Top 6 to 10 query keywords based on weighted TF-IDF 5-Gram query keywords based on PatTree
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
21
Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
22
Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
23
Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
24 Candidate Documents Suspicious plagiarism text Document Set Internet Resource
Text Alignment Source Retrieval
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
Suspicious document Query keywords
25 Candidate Documents Suspicious plagiarism text Document Set Internet Resource
Text Alignment
Source Retrieval
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
Suspicious document Query keywords
26
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
27
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
28
Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
29
Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
30 Match Merging
Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
31 Match Merging
Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
32 Match Merging
Extraction Filtering Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
33 Match Merging
Extraction Filtering Seeding
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
34 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
35 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
36 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
37 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
38 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
39 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
Merging Algorithm
40 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
41 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
42 Seeding Match Merging
Extraction Filtering
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
43
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
44
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
45
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
46
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
47
Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013
48
Use different methods to deal with different plagiarism
Query keywords extraction and ranking