Improved VSM Based Candidate Retrieval Model for Detecting External Textual Plagiarism

A rapid growth has occurred for the act of plagiarism with the aid of Internet explosive growth wherein a massive volume of information offered with effortless use and access makes plagiarism the process of taking someone else’s work (represented by ideas, or even words) and representing it as other's own work easy to be performed. For ensuring originality, detecting plagiarism has been massively necessitated in various areas so that the people who aim to plagiarize ought to offer considerable effort for introducing works centered on their research. In this paper, work has been proposed for improving the detection of textual plagiarism through proposing a model for candidate retrieval phase. The model proposed for retrieving candidates has adopted the vector space method VSM as a retrieval model and centered on representing documents as vectors consisting of average term weights and considering them as queries for retrieval instead of representing them as vectors of term weight. The detailed comparison task comes as the second phase wherein fuzzy semantic based string similarity has been applied. Experiments have been conducted using PAN-PC-10 as an evaluation dataset for evaluating the proposed system. As the problem statement in this paper is restricted to detect extrinsic plagiarism and works on English documents, experiments have been performed on the portion dedicated to extrinsic detection and on documents in English language only. For evaluating performance of the proposed model for retrieving candidates, Precision, Recall, and F-measure have been used as an evaluation metrics. The overall performance of the proposed system has been assessed through the use of the five standard PAN measures Precision, Recall, Fmeasure, Granularity and . The experimental results have clarified that the proposed model for retrieving candidates has a positive impact on the overall performance of the system and the system outperforms the other state-of-the-art methods. They clarified that the proposed model has detected about 80% of the plagiarism cases and about 90% of the detections were correct. The proposed model has the ability to detect literal plagiarism in addition to cases containing paraphrasing. Performance comparison has clarified that the proposed system is either comparable or outperforms the other baseline systems in terms of the five evaluation metrics.


Background
With Internet explosive growth, the massive volume of information offered with effortless use and access makes the process of taking someone else's work and representing it as other's own work easy to be performed. Due to that, a rapid growing has occurred for the act of plagiarism. Plagiarism is defined as reusing someone else's work (represented by ideas, or even words) without citing the source [1]. At the present time, detecting plagiarism is massively necessitated in various areas for ensuring text, materials, and resources originality. Plagiarism detection tool can have crucial role for preventing people aiming to perform intentional plagiarism so that they should offer considerable effort for contributing novel thoughts or even techniques to the academic world centered on their research [2].
Plagiarism detection (PD) is one application of Natural Language Processing (NLP) that is connected with methods from associated fields, such as and soft computing (SC), data mining (DM), and information retrieval (IR). Discovering illegal copying of text patterns from other sources is the focus of PD research [3].
Detecting plagiarism can be performed manually or automatically. The manual technique for identifying plagiarism inside the text is a big challenge. As understanding of text is different from person to person, and when the amount of information increases, a reader is less probably to be able to discover the similarity among textual contents. Therefore automatic plagiarism detection began to gain attention as it can be capable of providing an effective and efficient solution at a lower economic cost than the use of human resources [4]. Generally, automatic plagiarism detection is classified into two standard detection approaches extrinsic plagiarism and intrinsic plagiarism detection approach. Within the case of the first method, a comparison is performed for a suspected document against a collection of sources (corpus) [5]. Whereas in the second method, a suspicious document is analyzed to discover parts that have not been written through author of this specified document (author writing style) devoid of carrying out comparisons with an extrinsic collection of sources [6].
Plagiarism can appear in lots of fields, such as written text (textual), source code in programming languages, design, image, video, and even music portions. In academic, types of plagiarism may be classified into two primary types, source code plagiarism and textual plagiarism [7].
Source code plagiarism can be arisen in different ways, such as code manipulation, reordering the code structure without modification and language replacing [8]. On the other hand, textual plagiarism can be categorized into two standard ways based on the plagiarist's behavior: literal and intelligent plagiarism. Within the literal plagiarism, plagiarists don't make any effort to hide the plagiarism they committed. They just copy and paste the text from a specific source, with or without citing this original source (without clear quotation). While in the second way, various intelligent methods may be used to hide the original work, which may include textual content manipulation or obfuscation. Mainly, obfuscation is performed through, text insertion, text shuffling, text deletion, and so on. Obfuscations range from simple to complex, including, replacement with synonyms, translation, summarization and idea adoption [9]. All the previously mentioned cases of textual plagiarism types are considered as mono-lingual (plagiarized from text documents involving one language) except for text translation plagiarism also known as cross-lingual plagiarism (plagiarized from text documents involving more than one language) [10].

Related works
For detecting external plagiarism in textual documents, most works consider three main stages: preprocessing source documents and the suspected document for retrieving a reduced set of candidates that may be sources for plagiarism. Next, a second stage begins that compares in details the suspicious document and each of the candidates generated from the retrieval stage wherein plagiarized sentences are detected. Finally, the consecutive sentences within a given distance are grouped into sections and all the extracted sections pairs are presented in a task called heuristic post-processing task. In what follows are some of the works for detecting extrinsic plagiarism in texts.
Alzahrani and Salim [11] introduced a semantic plagiarism detection technique that implemented string similarity based on fuzzy semantic. The scheme proposed in their work was established on: preprocessing that involves segmentation, tokenization, stop words exclusion and stemming, next, a list of candidate documents were retrieved by means of measurement and shingling algorithm in correspondence to every suspected document. Furthermore, a sentence-wise comparison was performed between the suspicious document and the related candidate documents. In this step, degree of fuzzy similarity was computed which have values range between 0 and 1: 0 for sentences that are wholly dissimilar and 1 for matching sentences. If a fuzzy similarity score exceeding a specific threshold was attached to a pair of sentences, they were marked as similar sentences. Lastly, post-processing was performed in which successive sentences were merged to form plagiarized sections.
In [4], a model for similarity based on fuzzy semantic for detecting obfuscated plagiarism was offered. A comparison of the proposed model was performed against five state of the art methods. The work focused on applying part-of-speech (POS) tags in addition to similarity measures based on WordNet for studying semantic relatedness between words. For assessing the semantic distance between suspicious and source documents of short lengths, fuzzy-based rules were hosted, which implemented as a membership function to a fuzzy set, the semantic relatedness between words. A learning method which combined a permission and a variation threshold was implemented for making a decision about true plagiarism cases for the sake of minimizing number of false negatives and false positives. The model proposed in their work and the baselines were assessed on ground-truth annotated cases taken out from diverse datasets. Extensive experimental verifications were conducted by the authors involving studying the impacts of diverse segmentation approaches and different settings for the parameters. When their approach compared against the baselines, it was shown to be statistically significant using paired t-tests, which revealed the proficiency of the proposed model for detecting cases of plagiarism beyond the verbatim plagiarism. Furthermore, using the variance analysis (ANOVA) statistical test clarified the effectiveness of diverse segmentation approaches applied to the proposed model.
In [12], combining different similarity metrics were investigated for the detection of extrinsic plagiarism and it was centered on clarifying the significance of combining similarity measures over the commonly used single metric usage in detecting plagiarism. Moreover, analyzing the effect of using POS tagging in the plagiarism detection model was performed. Different combinations of the four single metrics, Match coefficient, Dice coefficient, Cosine similarity, and Fuzzy-Semantic measure were used with and without POS tag information. PAN-2014 was used as an evaluation dataset and PAN measures were used as an evaluation metrics for analyzing and comparing results.
In [13], an approach constructed on the linguistic knowledge for detecting plagiarism was proposed by A. Abdi et al.. for calculating similarity between two sentences, they integrated three similarity measures: for pair of sentences, calculating semantic similarity, measuring word-order similarity, and for pair of words, computing semantic similarity. The impact of the three similarity measures was analyzed at their approach and as a result the best combination of them was selected. According to the evaluation carried out using PAN-PC dataset, the proposed method verified that it was easy to follow and required minimal cost for processing text. The experimental results clarified that the performance of the proposed approach was competitive when comparing it with other methods in PAN-PC-10 and PAN-PC-11 datasets.
For the work in this research paper, the contribution is proposing a candidate retrieval model that affects the process of detecting plagiarism positively through taking in consideration how the way of representing documents can affect improving the detection of plagiarism. As a result, a retrieval model based on the commonly used VSM method has been proposed wherein documents have been represented as vectors of average term weights through using as weighting scheme and considering them as queries for retrieval instead of the representation as vectors of term weights. The rest of this paper is organized as follows: Section 3 introduces the statement of the external plagiarism detection problem together with the preliminary concepts for its main stages. Next, the description of the work proposed in this paper has been introduced in section 4. Performance evaluation and the performance comparison of the proposed system have been introduced in section 5. In addition, analysis and discussion of the proposed work have been introduced in this section. Finally, conclusions and some future works directions have been introduced in section 6.

Preliminary concepts 3.1 Problem statement and formulation
For the problem of external textual plagiarism detection, a suspected document represented by and a massive collection of sources represented by are given. For detecting plagiarism in the proposed system, three tasks have been performed in the sequence illustrated in what follows: Firstly, a smaller set of candidates form that are the most similar to and may be the source of the plagiarized contents is retrieved. The detailed comparison task comes in the second stage in which is compared in detail against each document contained in , then an extraction of pair of sentences belonging to the documents under comparison and is performed and the the sentence is considered plagiarized if it's similarity with the sentence of the candidate is within or higher than a given threshold. This similarity is measured by measuring the fuzzy semantic based string similarity. Finally, the consecutive sentences within a given distance are grouped into sections and all the extracted sections pairs are presented in a task called heuristic post-processing task. Preliminary concepts together with the implemented and the proposed models and algorithms for the stages of the plagiarism detection are illustrated in what follows:

Candidate retrieval stage
The preprocessing steps in this stage include: Tokenization, punctuation elimination, lowercasing, removing duplicate tokens, stop-words removal, stemming and removal of duplicate stems. The result from preprocessing is the set of all distinct terms exist at the suspicious document and the collection of source documents ; { }. The objective of candidate retrieval stage is to retrieve a reduced set of sources from the corpus that are relevant and satisfy a global similarity to the suspected document and will be determined as candidates to be sources of plagiarism. This preliminary filtering is a significant task for reducing the number of possible pairs of documents before the exhaustive analysis phase. The source documents within an equal or a higher similarity score than a particular threshold are considered to be the candidates for the detailed comparison stage. Formally speaking: { } The Candidates list will be defined then as: { } Where: : The top sources that have attained similarity to Implementing the retrieval model In this stage, the commonly used information retrieval model has been adopted as a representation model for representing the suspicious document and the collection of sources comprise . Elements comprising these vectors have been then weighted according to this model. Next, the measure has been used for measuring the similarity between each pair of documents that have been represented as vectors. Finally, the lists involving the pair of documents and their related similarity resulted from this retrieval models are then ranked in a descending order based on score and the sources within or greater than a given threshold have been retrieved and considered as candidates for the detailed comparison stage. Value of has been tuned and tried for several values to investigate the suitable value that the system performs well using it for retrieving candidates. In Step five: Sort in descending order Step six: Extract ( wherein Step seven: Stop

Detailed comparison stage
The detailed comparison has been implemented through measuring the fuzzy similarity between the sentences comprising the pair of documents under comparison. Firstly, the suspicious and the set of the candidate documents retrieved from the candidate retrieval stage are preprocessed. Next, is compared in detail with each in through measuring fuzzy semantic-based string similarity. The specified pair of documents are compared using sentence level comparison. A sentence is considered as a plagiarized sentence if it recorded a similarity score equal to or larger than a given threshold value.
For a detailed comparison, firstly, the preprocessing steps necessitated for measuring fuzzy similarity between the pair of documents under comparison are performed. For preprocessing the pair of documents under comparison through fuzzy similarity, the specified pair of documents, and are preprocessed wherein the segmentation process is applied to the documents for segmenting them into individual sentences. Then, sentences that comprise three words or less are discarded and the duplicate sentences are excluded. After that, tokenization, punctuation elimination, lowercasing, removal of duplicate tokens, stop words removal, lemmatization and exclusion of duplicate lemmas are applied in sequence. Lemmatization has been used in this proposed work instead of stemming process for the reason that lemmatization yields dictionary base forms that are suitable for comparing semantics. As a result of preprocessing, the specified pair of documents under comparison will be stated formally as: denotes the total number of sentences whereas, denotes the total number of sentences comprising . Afterwards, the fuzzy semantic similarity is calculated between sentence in with sentence exist at the candidate .
A fuzzy similarity between two words and can be computed as in Eq. (2) [11].

{ (2)
However, to obtain the degree of similarity between two sentences ( , ), a term-tosentence correlation factor for each term w 1 in s 1 and the sentence s 2, is computed as in equation (3).
Where are words in and is a fuzzy similarity between and . According to the of every word in a sentence , which is computed against sentence , the similarity between and can be defined as in equation (4) (

, ) = (4)
Where n is the total number of words in . However, if the two sentences and sentences have an unequal number of word in this case, the minimum similarity score must be computed as in equation (5).
Where , is a permission threshold value, which is the minimal similarity between pair of sentences and .
Finally, the suspicious sentence with similarity score against that is within or exceeds the threshold values for fuzzy semantic based similarity is considered as a plagiarized sentence. As a result, the sentence pair are marked as plagiarized and included in an output list and sorted in descending order according to their similarity score attached with their corresponding suspicious and candidate documents.

Post processing stage
Regarding the list of sentences pairs together with the documents comprising them where and resulted from the detailed comparison stage, the successive sentences that are within a given distance are merged to constitute the plagiarized passages. A distance of 100 characters is considered for the proposed work. Finally, the plagiarized passage and the source passage from the document that has been verified to be the source of plagiarism are presented to the user as the pair of passages together with the documents involving them, where and .

The proposed method
A retrieval model named as has been proposed in this paper. In this model, and all in are segmented into individual sentences and the duplicate sentences and the sentences with three words or less are removed. After that, the documents are preprocessed following all the steps applied to the first model . ) has been used. Also, lemmatization has been used instead of stemming for generating lemmas. For semantic-based analysis, WorldNet v3.0 using MySQL has been used for querying the table and extracting word synonyms.

Evaluation metrics
A plagiarism detection system is typically assessed through the use of the standard evaluation metrics which includes , , and . Furthermore, in the context of the competitions, and metrics have been proposed. measures the method accuracy at discovering the right segmentation for cases of plagiarism, whereas, characterizes the total score of combining and [14]. For further explanation, let be a plagiarized document; defines a characters sequence each of which is considered as plagiarized or non-plagiarized. A plagiarized section forms an adjacent arrangement of plagiarized characters in . The set of all plagiarized sections in is denoted by . Also, the set of all sections found through a plagiarism detection algorithm is denoted through . If the characters in are considered as basic retrieval units, precision and recall for a given ( ). Computing and is illustrated in (6) and (7) respectively: ∑ s S (s r) (6) ) = ∑ r (s r) (7) Where computes the positionally overlapping characters.
The two measures are sometimes used together in the to provide a single measurement for a system which is calculated as in Eq. (8).
= (8) In addition to and , another evaluation metric, is used for evaluating plagiarism detection system which is defined as the ratio of the number of recognized plagiarized source sections to a given plagiarized source sections as illustrated in Eq. (9) : = ∑ Where cases are recognized through detecting in and are the detections of a given . The domain of granularity is [ ]. The minimum and ideal granularity value is 1 and indicates the worst case. The measures are joined into a single score for making a unique ranking among methods of detection as in Eq. (10). = (10) For the candidate retrieval stage, the evaluation metrics which are used in the IR field, and were used of the proposed method. A is described as the number of relevant documents retrieved through an algorithm divided by the total number of existing relevant documents, while is defined as the number of relevant documents retrieved through a search divided by the total number of documents retrieved via that algorithm.

Parameters setting and Performance evaluation
In the present section, the main focus is on optimizing parameters for the candidate retrieval proposed model. For being more specific, its parameters have been tried to be optimized through running the proposed model on the training dataset and then, selecting the values that the proposed model performs well using them for evaluating the proposed model using the testing dataset. Next, the proposed model has been evaluated using evaluation metrics. After that, a comparison has been performed between the results of the proposed model against the existing model . In order to discover the suitable number of candidates ( ) to be regarded at the next stage, experiments have been carried out with different values for for the existing and the proposed model. On considering the results, it is shown that the best performance has been attained with .  Figure 1-Performance evaluation of implementing the existing retrieval model and the proposed candidate retrieval model In Table-1, 2,3 and Figure 1,2,3 the performance evaluation of the proposed candidate retrieval model has been introduced. Firstly, for the implementation of the model named that bases on utilizing VSM for representing documents under comparison as vectors whose elements are weights of their terms through using weighting scheme, the performance evaluation has been achieved. Secondly, performance evaluation of the proposed model has been achieved which considers elements constituting the vectors as average term weights instead of term weights wherein weighting scheme has been used. It is observed that performs better than .

Results analysis and discussion
PAN10 has been used as an evaluation dataset for evaluating the proposed model. Two types of plagiarism are included in the corpus: extrinsic and intrinsic plagiarism. As the problem statement in this work is restricted to detect extrinsic plagiarism and to work on English documents, our experiments have been performed on the portion dedicated for extrinsic detection which involves 70% of the documents in the collection and on documents in English language only. These documents have been randomly separated into training and testing dataset. The training data have been used for parameters tuning whereas evaluating the performance of the proposed system and comparing it against the existing methods have been performed using testing dataset. For evaluating the performance of each of the proposed models, five folds with an equal number of plagiarism case types (high obfuscation, low obfuscation and none obfuscation) have been evaluated and their average has been considered. The models proposed for solving candidate retrieval problem have been evaluated using and as an evaluation metrics. The overall performance of the proposed system has been assessed through the use of the five standard measures , , , and . Experimental results clarified that the proposed model has detected about 80% of the plagiarism cases and about 90% of the detections were correct. In the proposed model, the reasons for recording low recall in the work [11] belongs to the use of stems instead of lemmas which has been overcome in the proposed system. The other reason that has been taken into consideration is focusing on improving the stage of candidate retrieval in order to improve the recall of detection stage.

Conclusions and future works
Based on the commonly used VSM retrieval model, a model for retrieving candidates and necessitated for the detailed comparison stage has been proposed. This proposed retrieval model that represents documents as vectors constituting average weights of their terms instead of term weights and then measuring the similarity between the centers of the documents has improved the performance of retrieval problem and the overall performance of the plagiarism detection system. Experimental results demonstrated that the proposed model has the ability to capture the relevant document and passing them as candidates for the detailed comparison. They clarified that the proposed method has detected about 80% of the plagiarism cases and about 90% of the detections were correct. The proposed model has the ability to detect literal plagiarism in addition to cases containing paraphrasing. Performance comparison has been illustrated that the proposed system either outperforms or comparable with other baseline systems. As future work, we aim to improve the performance of the