Plagiarism Detection Methods and Tools : An Overview

Plagiarism Detection Systems play an important role in revealing instances of a plagiarism act, especially in the educational sector with scientific documents and papers. The idea of plagiarism is that when any content is copied without permission or citation from the author. To detect such activities, it is necessary to have extensive information about plagiarism forms and classes. Thanks to the developed tools and methods it is possible to reveal many types of plagiarism. The development of the Information and Communication Technologies (ICT) and the availability of the online scientific documents lead to the ease of access to these documents. With the availability of many software text editors, plagiarism detections becomes a critical issue. A large number of scientific papers have already investigated in plagiarism detection, and common types of plagiarism detection datasets are being used for recognition systems, WordNet and PAN Datasets have been used since 2009. The researchers have defined the operation of verbatim plagiarism detection as a simple type of copy and paste. Then they have shed the lights on intelligent plagiarism where this process became more difficult to reveal because it may include manipulation of original text, adoption of other researchers' ideas, and translation to other languages, which will be more challenging to handle. Other researchers have expressed that the ways of plagiarism may overshadow the scientific text by replacing, removing, or inserting words, along with shuffling or modifying the original papers. This paper gives an overall definition of plagiarism and works through different papers for the most known types of plagiarism methods and tools.


Introduction
Due to the rapid advancement of the computer and network technologies, such as the Internet that enables anyone to access online contents anytime and from anywhere, academic integrity in the academic community is becoming a highly sensitive issue, especially among universities and research institutions. Plagiarism, on the other hand, is defined as a kind of academic dishonest behavior that will damage academic integrity [1]. Thus, it is needed to be resisted determinedly. However, plagiarism is not only an academic issue, but it extends to almost all industries. Occasionally, plagiarism occurs accidentally but most of the time it is the outcome of a conscious process [2]. The best definition of plagiarism might be that it is "the unacknowledged copying of documents or programs" [3]. To overcome the problem of plagiarism, large number researchers have worked on detecting plagiarism since the past decades through software detection methods [4,5]. Plagiarism was originally detected manually (by hand) or by resembling previously consulted content. Today, the great number of the available online documents make it harder to detect plagiarism manually. Therefore, there is an urgent need to produce automatic plagiarism detectors [5]. There are two main types of plagiarism, namely the verbatim/literal and the intelligent plagiarism. Plagiarism detection methods are also classified into the internal detection method, where the document is analyzed for plagiarism alone, and the external detection method, where detection is made among a collection of documents. Verbatim/literal plagiarism describes the plagiarized content as the exact copying of the source content without altering or modifying the original content. While, in intelligent plagiarism, the main content is altered/modified by different ways. Intelligent plagiarism is more difficult to reveal and includes adoption of the ideas, translation to another language, and manipulations [6,7].  [6,8] This overview paper sheds the light on the description of plagiarism. In section two, plagiarism process will be reviewed, in section three, plagiarism classification and methods will be explained in details, in section four, plagiarism tools will be reviewed, in section five, the types of datasets used in plagiarism detection will be illustrated, in section six, a discussion about the reviewed works will be summarized, and finally in section seven, a conclusion will summarize the topic of plagiarism.

Plagiarism Process
To design and produce robust and no error Plagiarism Detection Process, four main stages are required. These stages described below [8,9]: 1. Collecting the content: It is the first stage, where the plagiarism detector collects the required content from the users through a search engine which acts as the interface between the users and the detector. 2. Analyzing the similarities: After collecting the content (scientific papers, assignments, and other softcopies), the detector runs an analyzing method to search for the similarities among the documents and reveal the original copy. 3. Confirming the copy: After the analyzing stage, a function for plagiarism conformation is required to reveal the plagiarized text from an original one. Sometimes, a degree of the plagiarized text is confirmed with this process. 4. Investigation: It is the final stage, which depends highly on the interference made by the user whenever a plagiarism is confirmed. It also relies on the expert of the user to distinguish between the really plagiarized documents and the cited ones.

Plagiarism Classification
Plagiarism can be divided into two basic categories, which are the monolingual and the crosslingual. Monolingual plagiarism works with most detectors. It is about homogeneous languages, as in the case of English language setting-English language setting. Cross-Lingual plagiarism works with heterogeneous languages; for example, English language setting-Chinese language setting, and this type is quiet rare [5] [10] [11]. In the next section plagiarism, types will be discussed in details.

Plagiarism Types
Plagiarism types appear in different works, documents, scientific papers, and research article. It can be classified as in the following ways [12]: (i) pretension of others work as your work, (ii) copying others' work without mentioning the credit or citation, (iii) whether citation was mentioned or not, calming someone's contribution as your own, (iv) refereeing to others work as yours by reconstructing their work, and (v) adding a misleading acknowledgments of others as your work. Textual plagiarism and Source Code plagiarism are the two main types of plagiarism and they will be reviewed in the following [13,14]

Textual Plagiarism
In researches and scientific fields, this type of plagiarism is the most common one, where the entire text or document is taken without referring to the author or mentioning a quotation. This type of plagiarism can be further divided into seven sub-classes, as in the following [13,14]: 1. Copy-paste plagiarism: This process refers to copying the original text without any acknowledgment about the authors or the original paper as if it was your work. 2. Paraphrasing plagiarism: It is classified into two categories: (i) simple paraphrasing, where the original text is presented into different way be replacing the words into similar ones with the same meaning and, (ii) Mosaic/Hybrid/Patchwork paraphrasing, where the text is a result of combining different contributions from different papers and presented differently without referring to the original citations of the works. 3. Metaphor plagiarism: presenting other ideas in better ways. 4. Idea plagiarism: the entire solution and ideas are stolen from others and claiming that it is an original research paper. 5. Recycled plagiarism: The authors here use their previous/old works and papers for a new publication. 6. 404 Error / Illegitimate Source plagiarism: when the citation of the works is invalid. 7. Re-tweet plagiarism: In this type, the citation is referred to but it is no difference between the original work and the author's work from the point of structure, grammar, and words.

Source-Code Plagiarism
This type appears typically in educational fields, where the programming code of a specific program written originally by someone is copied, adjusted, or reused by others partially or completely. It has the following four sub-classes [13,14]: 1. Manipulation plagiarism: where the source code is altered or modified by other developers by either deleting or inserting sub-codes to an original one without referring to the citation or acknowledgment. 2. Reordering structure plagiarism: where the syntax of source code is modified by functions or statements recording without referring to the original work. 3. No-change plagiarism: where the developers do not change anything in the code but add/remove spaces or comments as it was their work. 4. Language switching plagiarism: where the source code language is rewritten by other languages and declared as original code.

Plagiarism Detection
Many papers have searched for highly accurate plagiarism detection methods using different tools, but it was always challenging to find the perfect one, due to the rapid development of the technologies, software, and data mining tools. This development has become a double-edged weapon, as the methods of plagiarism have evolved; on the other hand, the methods of detecting this theft have developed in a response to the curbing of illegal methods of copying the original work of researchers [10,11] Plagiarism detection can be performed manually or by using an automated process. The automated process is very similar to natural language processing, visual identification, and bio-metric processes. All of these have a foundation of pattern recognition. Automated process does not give 100% accuracy. Thus, the manual checking is still needed.

Internal Plagiarism Detection
Thus type involves finding plagiarized passages within a document without access to the potential original text, also called intrinsic plagiarism detection.

External Plagiarism Detection
External plagiarism detection involves comparing suspicious plagiarized documents against potential original documents.  [10,15] In the next sections, different methods and tools for plagiarism detection will be illustrated.

Plagiarism Detection Methods
Many methods have been implemented by researchers to overcome plagiarism as it has grown to form a serious issue among the academic community; researchers have used different methods to overcome these activities [5]. Therefore, a comparative study depending on the attached sources about plagiarism detection, as viewed by researchers, is illustrated in Table-1. It is quite challenging to reveal a plagiarism act because it has many types of plagiarism, like cross-lingual syntax-based methods and/or dictionarybased methods. It requires extensive in-depth knowledge for multiple languages in more than one document [36,37]. 9. Grammar Semantics Hybrid Plagiarism Detection Method: A very effective and extensively implanted approach in the field of natural language processing (NLP) . It is very accurate in revealing copy-paste plagiarism or paraphrasing plagiarism. It provides a remedy for the limitations of the semantic-based method [38]. 10. Classification and Cluster-Based Methods: These are greatly helpful methods to retrieve the information during the process of searching in any plagiarized document. Also, the comparison time is reduced during the detection process when comparing these methods with other ones [39]. 11. Citation-Based Method: This is a novel method; it mainly belongs to semantic plagiarism detection methods for the usage of the semantics in the cited document. It looks for identical pair of documents based on the citation, because these techniques use semantics contained in the citation [40,41].

Plagiarism Detection Tools
A large number of tools have been developed and utilized to detect plagiarism [5]. Table 3 shows plagiarism detection tools according to their pros and cons, covering a period of 22 years from 1994 to 2020 [5]. it takes parts of the code at a time as an input and produces HTML pages as an output to analyze the similarities between a pair of documents.

Ithenticate [43] 1996
It is a text-document based plagiarism detection tool that is presented as a web page. It compares a number of documents with the original one without the need for installation on the end-user computer, but it is limited to 25,000 words per time.
JPlag [32] 1997 Similar to the previous ones, this type is an online source-code plagiarism tool. It takes a number of programming codes and selects the identical lines among them. It works with C, C++, and Java programming languages, with less than one minute to detect hundreds of

Tool
Year Characteristics code lines.
GPSP -Glatt Plagiarism Screening Program [44] 1999 Unlike the previous tools, it works off-line. It mixes different approaches and finds the similarities among the writing styles of differed authors. It reveals plagiarism by making the author goes through a fill-in-the-blank test. Then, it counts the correctly filled blanks and the time taken to finish the test. Finally, according to the results, it takes a decision about an act of plagiarism.

Turnitin [42][37] 2000
It is provided by iParadigms as a web based tool. The user is required to upload his/her required document online, then the document will be saved to the system's database. After that, the tool checks for plagiarism by creating a document fingerprint. It accepts nearly 15,000 institutions around the world, with more than 30 million users, for its flexibility and robustness. Therefore, it is considered as the best tool. Plagiarism Checker [45] 2006 This is a free and online tool, using search engine services to detect for students' plagiarism by checking if their documents have a similar copy of another online document.

Plagiarism Scanner [46]
2008 It is an effective tool that detects throughout almost all online resources, like libraries (Questia and ProQuest), online databases, websites, and search engines. When plagiarism is detected, it produces a full report including the rate, originality, and percentage of plagiarized materials.

PlagTracker [47] 2011
It is a well-known tool for all kinds of users (teachers, websites owners, and students)that accommodates a large number of academic resources in its database and produces a detailed report whenever a plagiarism is detected.

PlagScan [48] 2015
This tool provides multiple services to companies, universities, and schools, but it is not free and the users must have a paid account to register to this tool.

Exactus Like [49] 2016
This tool is a web-based online tool that works with different formats, like HTML, Microsoft Word, and Adobe PDFs. It detects moderately disguised borrowing (word/phrase reordering, substitution of some words with synonyms) by a deep parsing function.

Grammerly [50] 2016
This is a website and a mobile application service that offers a great opportunity to the individuals to correct their documents within a realtime manner and a friendly user interface. It works online; therefore it requires an internet connection.

Grammerly [50] 2018
It is an evolved version of the previous one, representing the premium type. It targets business industries, such as teams and companies. Users reported that Grammarly helps them more professionally.
DupliCheck er [51,52] 2020 This is an absolutely perfect method, available 24/7, and ready whenever the user needs it. It is one of the most effective and free plagiarism tools on the internet. The user only requires a search engine and a connection to the world wide web to access this tool. It enables the user to either copy-paste or upload the document to check for a plagiarism.

Datasets Used in Plagiarism Detection Systems
Two main datasets are used for plagiarism detection and are illustrated below.

WordNet Dataset
WordNet is a freely and publicly available dataset that contains large lexical English language words, such as nouns, verbs, adverbs, and adjectives. It contains over 155,287 words organized in 117,659 synsets. All these words are classified into groups of cognitive synonyms. WordNet not only links word forms (strings) but also specifies their meanings. As a result, words that are next to each other in the dataset are semantically disambiguated. Also, WordNet marks the semantic relationships among groups of words in the thesaurus that do not follow any clear pattern, except for similar meanings. Lexical-words that are represented by this dataset contain synonymy between words, like the words "large" and "wide", both having a relatively similar meaning. The phrase of a noun contains substitution definitions that contain one or more substitutions. Therefore, each formal meaning pair in WordNet is unique. The current version of this dataset contains not only English language words but also different languages, such as Italian and Spanish [53].

Plagiarism analysis, Authorship identification, and Near-duplicate detection (PAN)
Another well-known plagiarism detection dataset is PAN. It refers to plagiarism analysis, authorship identification, and near-duplicate detection of different types of plagiarism. It is a series of scientific events and shared tasks on digital text forensics and stylometry. Every year, an international conference and competition called PAN@CLEF are held to connect the most advanced publications about plagiarism detection techniques [54].

Discussion
In this part, an extensive review of the most serious and frequently used plagiarism techniques around the world is illustrated. Also, the most common challenges that are facing the development of effective and robust plagiarism detection systems are reviewed.

Comparison of Common Plagiarism Types
The detection systems are available in-hand and they are growing rapidly. Therefore, a comparison among the nowadays plagiarism techniques is required. This comparison will focus on two studies that show the most serious plagiarism techniques in the universities, schools, and higher education sector. This comparison is provided by conducting several scientific areas, such as those of medicine , natural sciences, engineering, and social sciences, from 40 different countries around the world, covering the period from 2013 through 2015. Tables 4 and 5 list the most commonly practiced plagiarism types [55] [56].

Challenging Factors Facing Plagiarism Detection
Among the previously published papers that have been focusing on plagiarism, many works made a dense search on plagiarism types and techniques. Most of the plagiarism detectors available today can do the following: (i) distinguish between plagiarism in source code and/or in-text documents, with or without citation, (ii) feature extraction of semantic and/or salient syntactic, and (iii) plagiarism detection for both cross-lingual and monolingual documents [5]. Despite the availability and efficiency of these types, they are not effective enough to reveal the unattended research challenges. With the technology age that we are living in, new algorithms are produced to solve many problems, such as plagiarism, which is a problematic issue that needs to be extensively solved, especially in the scientific community. Computer science approach can address these challenges relying on the ICT advancement, some of these challenges are highlighted in the following: (i) A proof for correctness and completeness of the scientific works, i.e. whether they are ate in text documents or written as source code, is not available yet, (ii) a highly accurate framework for plagiarism detection that can reveal text segment(s), for both intrinsic and extrinsic plagiarism detection, is missing, (iii) The development of pilgrim checking systems without the need for external references and with high accuracy is considered a very challenging task, and (iv) Providing a full system for scientific works repository that can combine the works of all authors and the references to their works in one place is a difficult manner [5].

Conclusions
In this paper, an extensive literature survey about plagiarism types, methods, classification, and tools was conducted. Text plagiarism, with its seven sub-types, and source-code plagiarism, with its four sub-types were highlighted. Then, plagiarism detection methods were illustrated and summarized over nearly more than twenty years. The newly developed tools are more advanced, most of which are working online using an internet connection and a web page, some of them are delivered freely and others require subscription payment. Next, the most known datasets implemented in plagiarism detection were reviewed and a table was prepared to discuss the methods to be adopted in plagiarism . We notice that each method has its strengths and weaknesses that depend on how it is described to support two important factors: time and accuracy. We also notice that there are two mechanisms of action, namely the parallel and the series mechanisms. The parallel mechanism provides higher accuracy and less time because it performs a scanning for all the contents of a dataset. For this reason, we can say that a good algorithm is the one that covers the required conditions in terms of time, accuracy, or both. Finally, a discussion about the most frequent plagiarism types was extensively provided and the most challenging steps during the implementation of plagiarism detection were investigated.