Improved VSM Based Candidate Retrieval Model for Detecting External Textual Plagiarism

Mohannad T. Mohammed; Nasreen J. Kadhim; Abdallah A. Ibrahim

doi:10.24996/ijs.2019.60.10.20

Authors

Mohannad T. Mohammed College of Health And Medical Technology, Middle Technical University- Baghdad- Iraq
Nasreen J. Kadhim College of Science University of Baghdad, Baghdad,Iraq
Abdallah A. Ibrahim College of Science University of Baghdad, Baghdad,Iraq

DOI:

https://doi.org/10.24996/ijs.2019.60.10.20

Keywords:

External Plagiarism, vector space model, TF-IDF, TF-ISF, fuzzy similarity

Abstract

A rapid growth has occurred for the act of plagiarism with the aid of Internet explosive growth wherein a massive volume of information offered with effortless use and access makes plagiarism the process of taking someone elseâ€™s work (represented by ideas, or even words) and representing it as other's own work easy to be performed. For ensuring originality, detecting plagiarism has been massively necessitated in various areas so that the people who aim to plagiarize ought to offer considerable effort for introducing works centered on their research.

In this paper, work has been proposed for improving the detection of textual plagiarism through proposing a model for candidate retrieval phase. The model proposed for retrieving candidates has adopted the vector space method VSM as a retrieval model and centered on representing documents as vectors consisting of average term weights and considering them as queries for retrieval instead of representing them as vectors of term weight. The detailed comparison task comes as the second phase wherein fuzzy semantic based string similarity has been applied. Experiments have been conducted using PAN-PC-10 as an evaluation dataset for evaluating the proposed system. As the problem statement in this paper is restricted to detect extrinsic plagiarism and works on English documents, experiments have been performed on the portion dedicated to extrinsic detection and on documents in English language only. For evaluating performance of the proposed model for retrieving candidates, Precision, Recall, and F-measure have been used as an evaluation metrics. The overall performance of the proposed system has been assessed through the use of the ï¬ve standard PAN measures Precision, Recall, F-measure, Granularity and . The experimental results have clarified that the proposed model for retrieving candidates has a positive impact on the overall performance of the system and the system outperforms the other state-of-the-art methods. They clarified that the proposed model has detected about 80% of the plagiarism cases and about 90% of the detections were correct. The proposed model has the ability to detect literal plagiarism in addition to cases containing paraphrasing. Performance comparison has clarified that the proposed system is either comparable or outperforms the other baseline systems in terms of the five evaluation metrics.