Using Retrieved Sources for Semantic and Lexical Plagiarism Detection

Plagiarism is described as using someone else's ideas or work without their permission. Using lexical and semantic text similarity notions, this paper presents a plagiarism detection system for examining suspicious texts against available sources on the Web. The user can upload suspicious files in pdf or docx formats. The system will search three popular search engines for the source text (Google, Bing, and Yahoo) and try to identify the top five results for each search engine on the first retrieved page. The corpus is made up of the downloaded files and scraped web page text of the search engines' results. The corpus text and suspicious documents will then be encoded as vectors. For lexical plagiarism detection, the system will leverage Jaccard similarity and Term Frequency-Inverse Document Frequency (TFIDF) techniques, while for semantic plagiarism detection, Doc2Vec and Sentence Bidirectional Encoder Representations from Transformers (SBERT) intelligent text representation models will be used. Following that, the system compares the suspicious text to the corpus text. Finally, a generated plagiarism report will show the total plagiarism ratio, the plagiarism ratio from each source, and other details.

2. Combine Jaccard and TFIDF with cosine similarity to detect lexical plagiarism. Hybrid approaches to detect semantic plagiarism include Doc2vec and SBERT. The focus of this paper is on external plagiarism in monolingual English texts. The paper is prearranged as follows: Related studies on the detection of plagiarism are presented in section 2. The adopted approaches and methodologies are described in Section 3, while the proposed system is presented in Section 4. The results from the experiments are explained in Section 5, and Section 6 will present the discussion. Finally, Section 7 discusses conclusions.

Literature Review
Plagiarism is threatening the growth and prosperity of academic communities, especially with the introduction of the Web, which is why efforts must be made to avoid it. Several plagiarism detection approaches are implemented. Some of those approaches are reviewed in Table 1. Implemented a Google API-based PD system named "Spotting and Neutralizing Internet Theft by Cheaters" (SNITCH), Each text is looked up on the internet. A concise report in the form of an annotated HTML page which includes statistics on the ratio of plagiarism, and the required time to accomplish the verification.
It supports only analysis of text documents and the concise report in the form of an annotated HTML page only.

[6]
They used the TF-IDF method to represent text numerically in order to detect plagiarism. The accuracy of the assignment was determined by counting the number of correctly assigned documents and dividing them by the total number of documents.
They have not given any results for the implementation of the system.

[14]
The web-based PD system is made up of two main modules: one global component based on heuristics; The searches are subsequently sent to Google via its API for search operation. A report of similarity for the suspicious document will be created, where the plagiarized sections will be highlighted with different colors to show where they came from. A report of similarity for the suspicious document will be created, where the plagiarized sections will be highlighted with different colors to show where they came from.
It uses only Google search engine, the maximum number of queries per day set by Google is 100 for free subscription account.

[15]
Described a web-based antiplagiarism technique at the academic level. The Google search API is used to get specific keywords or key phrases from a given Web content. The findings are shown in the form of a URL by the tool.
Only Google search engine is used. The acceptable text file formats are (.txt, .doc) only.

[16]
Used the concept of the k-Nearest Neighbor Algorithm (k-NN) machine learning (ML) technique to compare a set of text with some existing multiple files to determine the copied component.
The ratio of plagiarism for each source file is displayed only as one of two words (plagiarized, not plagiarized) and not as a detailed report.

[17]
Proposed a web tool for verifying multilingual texts (English and Arabic). The program searches the internet for duplicate material using three common engines for web searches: Yandex SERP, Bing, and Google. They used TFIDF and the cosine text similarity approach. For each suspicious sentence, the HTML report indicates if the sentence is plagiarized or not.
For each suspicious sentence, the HTML report indicates if the sentence is plagiarized or not. So, it only performs plagiarism detection at the sentence levels.
The previous reviewed papers did not consider the semantic approach for plagiarism detection and did not generate a colored PDF plagiarism detection report. Furthermore, they used only some sentences directly or saved them as text files, so to overcome the existing cons, the purpose of this paper is to design an online source retrieval-based system for plagiarism detection.
This system uses Application Programming Interfaces (APIs) of search engines Google, Yahoo and Bing to utilize the specific advantages of each one and can examine the suspicious document to detect plagiarism lexically and semantically. The suspicious document may be a pdf or docx file. The result of the search will be either web pages to scrape their text contents or pdf or docx file formats to download them. The citation for "quotes" removal was used in the article, which is used to eliminate sentences that appear inside quotation marks. When it comes to documents like research papers, journal papers, and articles, if the text is presented inside quote marks, it is not considered plagiarized. Furthermore, the cited sentences are ignored in detecting the plagiarism.

Methodology
The corpus, text processing techniques, and plagiarism detection algorithms employed in our tests are described in this section.

The corpus
The corpus is used in plagiarism detection projects to measure the similarity of suspicious documents and calculate the plagiarism percentage [18]. The corpus can be a pre-existing offline dataset to which the experiments are applied, or it can be on-line sources on the World Wide Web [19,20].
Since we do not have the huge corpus that the common plagiarism detection systems have, like the Turnitin and ithenticate platforms, this paper will propose to build a dynamic corpus by scraping the related HTML web pages, extracting the text by parsing those pages, and collecting text data by downloading free source documents such as pdf or docx files from the web.

Text Representation
A text representation or word embedding is a function that maps a word or a sentence to a small-dimensional vector, with the distance between vectors indicating how similar the words and sentences that correspond to the vectors are [21].
Many classical and intelligent models for PD are needed to represent or encode text as numeric vectors. Some of the text representation approaches are mentioned in the next paragraphs [ 22 ] . Recent advances in neural networks have made creating a distributed representation that accurately reflects word similarity from real-world data simple [23].

Term Frequency -Inverse Document Frequency (TFIDF)
TFIDF is a technique for text weighting that is frequently utilized in conjunction with cosine similarity to detect the similarity of two texts [24].
The TFIDF algorithm considers the frequency of various terms in all papers and is capable of distinguishing them. Term Frequency is abbreviated as TF, while Inverse Document Frequency is abbreviated as IDF [7,25]. The following is the equation to calculate the weight of a term in a single document [25] : Where: Wt,d : weight of term "t" in single document "d". TFt,d: The frequency with which the term t (Term) appears in the document d. IDFt = The frequency of inverse documents as calculated by equation 2 ]25[ : Where: N = Total of wholly documents = The total number of documents that contain the term "t". The IDF shows the difference in the term "t" in each text by reflecting the spread of the term throughout the document. The spread of the terms in a document is represented by TF. TF-IDF is an excellent approach for calculating term weights because it can make exceptions for high-frequency terms that have little in common [7,26].

Document to Vectors (Doc2Vec)
Word embedding is a sort of word representation as a numeric vector. It uses a lowdimensional vector to contain contextual information. Doc2vec is an unsupervised ML intelligent technique for generating vector representations of phrases, paragraphs, and documents. It does this in a very easy way: it considers a piece of text as a particular type of word [2]. In this technique, a text is converted into a vector that reflects the degree of significance of a specific word in the text. Because the Doc2vec model retains the context of words encountered, the entire document may be plotted as a vector depending on its semantic meaning.  [2]. The wp;n denotes the nth word in passage p in these diagrams, and p denotes a unique identification for a passage (or document). As a result, one can use the DM model to add the passage's identity to each context created from it.
Unlike the skip-gram model, which attempts to detect a context for a particular word, the DBOW model attempts to detect a context for a particular passage. A context in this example is a word sequence produced from a passage by selecting a text window at random. In the distributed memory setup, a paragraph vector functions as an object that recalls when it was trained with which words [10].

Sentence Bidirectional Encoder Representations from Transformers (SBERT)
The Bidirectional Encoder Representations from Transformers (BERT) is a text representation paradigm made up of many transformer encoder blocks stacked on top of each other. Rather than extracting a word's semantic meaning, the entire sentence is considered, utilizing the Deep Learning (DL) paradigm. The BERT learning process is divided into two stages: pre-training and fine-tuning. The BERT model learned by predicting the masked word token in the pre-training procedure, which involved randomly masking word tokens in a phrase from a huge corpus. Fine-tuning is the process of relearning a previously trained BERT model with labelled data [27]. Sentence-BERT (SBERT) is a pre-trained BERT network that uses Siamese and Triplet network architectures to generate semantically relevant embeddings for each sentence, so that the generated embeddings can be compared using cosine-similarity [28]. SBERT's network structure ( Figure 2) is determined by the training data supplied. Experiment with the structures and objective functions shown below. The cosine similarity between the two embeddings u and v is calculated. Figure 2: SBERT architecture at inference, for example, to compute similarity scores [28]

Approaches of text similarity
In general, the similarity method receives two texts and outputs the degree of similarity between them. The two things represented by numbers are the values produced by the similarity function range between [0,1] [29].
There are numerous measures for calculating similarity in the literature. Two of the most famous are Jaccard and cosine similarity [30]. Jaccard Similarity is a technique that uses a count-based co-occurrence measure and is used to determine text similarity. The number of elements in the intersection set divided by the number of elements in the union set can be used to calculate the Jaccard coefficient as follows [31]: Where |S| and |D| denotes to the word count for specious and source texts respectively, | ∩ | and | ∪ | are the count of words in intersection and count of words in union between suspicious and source texts, respectively.
The computation of similarities between two vectors by looking for cosines from the angle between them is known as cosine similarity, and it is commonly used in text mining to compare documents [32]. The following is the equation for calculating cosine similarity [9] : ( , ) = .
|| |||| || (4) Where: y , x: the vector of dot product between x and y, calculated by : ||x|| : the length of the vector x, as determined by : ||y|| : the length of the vector y, as determined by : The larger the values of the similarity function, the more similar the two items assessed are, as stated in [17] . If the reverse is true, then the lower the value of the similarity function, the more distinct the two items are thought to be. Manhattan and Euclidian distances are two common dissimilarity measures. A dissimilar measure (d) can be used to calculate a similarity measure (s) by subtracting d from 1 [24].
The Cosine similarity metric is superior to others because even if two text documents are separated by significant distances, they are likely to be similar in terms of context [33]. Besides that, several authors got the best results when using cosine similarity.

The proposed system
This paper presents a source retrieval plagiarism detection system. It detects text plagiarism online by utilizing the APIs of three prominent search engines (Google, Yahoo, and Bing) for lexical and semantic text similarity concepts. It receives a text document with a plagiarism threshold as input and produces a report that informs the user if the document is unique or plagiarized lexically, semantically, or both lexically and semantically, as well as other information such as the source text and the whole percentage of plagiarism. The structure of the presented system is illustrated in Figure 3. The system proceeds in several stages, which will be explained in the next sections. There are three kinds of experiments: lexical, semantic, or both lexical and semantic plagiarism detection approches are implemented. In all three situations, the final similarity report for the suspicious text will be created, with the plagiarized sentences marked in several colors that identify the source, and the unique sentences will not be highlighted.

Lexical plagiarism detection using Jaccard Similarity -TFIDF with Cosine similarity
This can be done in two approaches after preprocessing and splitting them into n-words: 1-Jaccard Similarity : apply Jaccard similarity between suspicious and original texts to obtain lexical PD. 2-The texts are represented as TFIDF vectors. After that, the cosine similarity is applied to detect whether a sentence is lexically plagiarized or not .
The lexical plagiarism detection process includes several stages, as follows: Stage 1 (Input Text): The user uploads a scientific file such as a pdf or docx formatted file. The user also inputs a plagiarism detection threshold and a number n to split texts into n-words per sentence.

Stage 2 (Extract Title and Keywords):
Extract the title and keywords from the uploaded scientific paper file using the Regex and NLTK libraries of Python. Stage 3 (Submit the query): This stage uses three widespread search engines: Google, Yahoo, and Bing APIs and attempts to scrape the top five results for each search engine if it's a webpage or download and store the obtained pdf or docx files using the bs4 and request libraries of Python according to the title and keywords extracted from the previous stage.

Stage 4 (Build Corpus):
Convert the scraped webpages and downloaded documents to text files in order to build the corpus.

Stage 5 (Pre-process using NLP):
The suspicious text and the source texts are pre-processed at this stage using NLP techniques that exist in the NLTK library of Python, including: 1. Tokenizing the text. 2. Removing stop words, single characters, punctuations, non-alphabetic symbols and numbers. 3. Convert all capital letters to small letters. 4. Lemmatization of verbs. 5. Split text into sentences with n words in each sentence according to the user's desire. Considering the ability to remove the quoted and cited paragraphs from the suspicious text using the Regular Expressions (Regex) library in Python.

Stage 6 (Text representation):
Use Jaccard similarity: Apply the Jaccard similarity with each of the suspicious document sentences and source sentences. Suppose that J(S,D) is the Jaccard similarity between suspicious sentence S and source sentence D. Calculate TFIDF text encoding for each of the source documents' sentences and suspicious document sentences. For each TFIDF vector of suspicious document sentences, apply the cosine similarity to each of the TFIDF vectors of source document sentences. Suppose that C(S,D) is the cosine similarity between suspicious sentence S and source sentence D.

Stage 7 (Similarity measure):
The suspicious sentence S is lexically plagiarized if the following Eq. 8 is satisfied: Select the largest value that is greater than the threshold1 and mark it as lexical plagiarism. Stage 8 (The Result) : Generate the plagiarism report as a web page and as a PDF file. Each plagiarized sentence is highlighted with a color that refers to the source document or web page from which this sentence is plagiarized. The report contains a percentage of plagiarism, a count of characters, a number of words in the suspicious text, and other information. The value of threshold1 must be inputted by the user to detect lexical plagiarism. The value must be chosen in the range [0,1] to carry out experiments.

Semantic Plagiarism detection using Doc2Vec -SBERT with cosine similarity
The text is preprocessed and splitted into n-words. Then the Doc2vec and SBERT models are used to represent text as vectors. After that, the cosine similarity is applied to detect whether a sentence is semantically plagiarized or not.
The stages of semantic plagiarism detection are similar to those of lexical plagiarism detection except that stages 6 and 7 will be replaced by the following : Stage 6: 1. Use the Doc2Vec intelligent model: Build the Doc2Vec model for the text files of the corpus. Each text will be converted to a list of vectors using the Doc2Vec paradigm. 2. Use the SBERT deep learning model: Each of the suspicious and source sentences will be encoded using the SBERT model. The suspicious sentence S is plagiarized semantically if the following Eq. 9 is satisfied: The value of threshold2 must be inputted by the user to detect semantically plagiarized sentences. The value must be chosen in the range [0,1] to carry out experiments.

Lexical and Semantic plagiarism detection using (Jaccard -TFIDF) and (Doc2vec -SBERT) with cosine similarity
The text is preprocessed and splitted into n-words, then TFIDF, Doc2vec, and SBERT models are used to represent the text as two vectors, the first vector to check for lexical plagiarism and the second vector to detect semantic plagiarism. In this case, stages 6 and 7 of lexical and semantic PD will be merged. So, each sentence of the suspicious document S is marked as lexical and semantically plagiarized LSSim if the following Eq. 10 is satisfied using equations 8 and 9: ( , ) = ( , ) + ( , ) 2 ≥ ℎ ℎ 1 + ℎ ℎ 2 2 (10)

Results
The steps of the proposed system in this paper were applied to ten suspicious documents. No matter what the type of document is, pdf or docx, in any case, it will be converted to text to be handled according to the system. Meanwhile, if the suspicious file is a research or journal article, then the system will extract its title and keywords to use them as a query in the search engine. The result of the search will be either web pages to scrape their text contents or pdf or docx file formats to download them to build the related corpus of that file. Table 2 illustrates information about the ten suspicious documents, including types and the count of words in each. The results of lexical plagiarism are shown in Table 3, as percentages of plagiarism, with n-words equal to 5 and threshold1 = 0.2 as an experiment.
The result of the similarity measure between two texts may be 0, which means completely dissimilar (not plagiarized), or 1, which means identically similar (plagiarized). The other values between 0 and 1 reflect the limit of similarity/dissimilarity between the texts. Therefore, any value could be chosen as a threshold. If the value of the similarity measure is greater than or equal to the threshold, it is considered a similarity (plagiarized), otherwise it is considered a dissimilarity (not plagiarized). So, any value between 0 and 1 could be chosen as threshold1 and threshold2.  Figure 4 depicts a snapshot of the lexical plagiarism report of one suspicious document. Each plagiarized sentence is highlighted with a color that refers to the source document or web page on which this sentence is plagiarized, while Figure 5 illustrates the total and partial plagiarism percentages for a suspicious document. It shows the source document titles along with the author and website address if the source is a webpage.   Table 4 with n-words equal to 5 and threshold2 = 0.3 as an experiment, while Table 5 presents the results of semantic and lexical plagiarism with n-words equal to 5, threshold1 = 0.2 and threshold2 = 0.3 as an experiment.  Plagiarism has been detected in a sample of 10 documents with different percentages, where the highest percentage is 0.8% and the lowest percentage is 0.11%. To evaluate the system, the research utilized some common metrics, such as accuracy, precision, recall, and F1-score. The typical calculation of accuracy is obtained through formula 11 as follows [34]: in which (True Negatives: TN) are clean texts correctly classified as plagiarized, (True Positives: TP) are plagiarized texts correctly classified as clean, (False Positives: FP) are clean texts incorrectly classified as plagiarized, and "clean" (False Negatives: FN) are plagiarized texts incorrectly classified as "clean" [34]. Table 6 shows the accuracy, precision, recall and F1score obtained for the suspicious documents. The lexical plagiarism has higher accuracy since it is based on the TFIDF statistical text representation and Jaccard similarity, while the semantic plagiarism has lower accuracy since it is an intelligent approach for text representation.

Discussion
The system is focused on merging the results of several search engines in order to benefit from the specific advantages of each search engine in and eliminate some of the common blunders. Depending on the search methodology used by each search engine, each engine will return different results.
It is critical to guarantee that the project or design model produces reasonable outcomes and meets its objectives, which is why evaluation metrics are used to assess the project. Several constraints must be addressed in order to improve the accuracy and applicability of the proposed system. The most significant of these constraints are the accuracy of the three search engine results (Google, Bing, and Yahoo), the hardware of the laptop being used, and the speed of the internet.

Conclusions and Recommendations
This paper has provided a plagiarism detection system, which is an online plagiarism detection tool that allows users to examine a text for duplicate material on the Internet. The system takes text as input. The title or keywords of the text are extracted to create a query, which is then sent to three prominent search engines (Bing, Google, and Yahoo) through the API of each search engine to utilize the web and collect possible source texts from it. The approach of text similarity is then used to determine whether or not the provided material was plagiarized from texts found on the internet. Finally, for the provided text, a similarity report will be created, in which the plagiarized content will be highlighted by utilizing different colors to identify the original texts.
The present outcomes are encouraging, demonstrating that integrating several search engines yields better results than using each search engine alone. The used approaches are shown to be efficient, fast, and easy; querying a sentence on a search engine instantly returns several sources linked to the query.
In the future, further experimental research is required to extend the presented system by evaluating the performance of other similarity metrics and utilizing popular off-the-shelf word embedding methods like Stanford GloVe, Facebook fastText, and other deep learning-based approaches. Besides that, it is possible to utilize an existing offline corpus to calculate the plagiarism percentage with or without calculating the plagiarism percentage on-line.