Extractive Multi-Document Summarization Model Based On Different Integrations of Double Similarity Measures

Currently, the prominence of automatic multi document summarization task belongs to the information rapid increasing on the Internet. Automatic document summarization technology is progressing and may offer a solution to the problem of information overload. Automatic text summarization system has the challenge of producing a high quality summary. In this study, the design of generic text summarization model based on sentence extraction has been redirected into a more semantic measure reflecting individually the two significant objectives: content coverage and diversity when generating summaries from multiple documents as an explicit optimization model. The proposed two models have been then coupled and defined as a single-objective optimization problem. Also, for improving the performance of the proposed model, different integrations concerning two similarity measures have been introduced and applied to the proposed model along with the single similarity measures that are based on using Cosine , Dice and similarity measures for measuring text similarity. For solving the proposed model, Genetic Algorithm (GA) has been used. Document sets supplied by Document Understanding Conference 2002 ( ) have been used for the proposed system as an evaluation dataset. Also, as an evaluation metric, Recall-Oriented Understudy for Gisting Evaluation ( ) toolkit has been used for performance evaluation of the proposed method. Experimental results have illustrated the positive impact of measuring text similarity using double integration of similarity measures against single similarity measure when applied to the proposed model wherein the best performance in terms of and has been recorded for the integration of Cosine similarity and similarity


Introduction
One of the most important challenges facing humans today is the rapid increase in the amount of data generated by users, especially those on the Internet. Also, one of the most important types of data facing such a large increase is textual data, which made it very difficult for humans to take advantage of this data in its natural state. This has made the need for an automated summary system for those data more important. Although research on a system to automatically summarize documents began at the end of the 20th century, so far there is no satisfactory outcome, and all researches have relatively modest progress.
Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web, the area of automatic text summarization is becoming very important in the field of Information Retrieval. The search engines do a remarkable job in searching through a mass of information to dish out the most related information the user is searching for. Even the information picked by search engines with a great precision is of a daunting amount. Reading through whole length of the document is very time consuming. Always a certain task demands a decision to be made in a definite time frame, and to read through all the documents is simply difficult. Availability of the core of the document makes the process speed up considerably. When dealing with problems like that, the technology of automatic text summarization becomes critical. The document summarization system can be classified as follows: Document summarization methodologies can be generally divided into extractive and abstractive methodologies. Abstractive summarization can be defined as producing a summary that involves concepts/ideas reserved from the source, which are then "reinterpreted" and offered in a dissimilar form. An extractive summarization is an approach for constructing a summary that consists of units of text reserved from the source and offered verbatim [1].
Taking in consideration the number of documents under summarization, the summary can be a condensed form of multiple documents or one document. Multiple document summarization aims at extracting information relevant to an implicit or explicit subject from different documents written about that subject or topic [2].
The approaches of extraction-based summarization can be categorized as supervised or unsupervised. Supervised approaches are constructed on algorithms that use a large number of summaries generated by human, and as an outcome, are most convenient for documents related to the summarizer model. Accordingly, they do not necessarily yield an adequate summary for documents that are dissimilar to the model. Furthermore, when the summarization purpose or documents' features are modified by the users, it becomes essential for reeducating the model or rebuilding the training data. Unsupervised approaches do not necessitate training data for training the summarizer. Automatic summary can either involves the most significant information overall (generic summarization) or the most relevant information considering an information need of the user (querybased summarization). Generic summarization approaches focus on covering diversity of the summary for delivering broader content coverage. Usually, they are described in terms of certain key features which relate to the concepts of intent, focus, and coverage.
Considering the usage, the summary can be indicative or informative. A condensed information on the key topics of a document can be provided through an Indicative summary. Document's most important passages should be preserved in this summary type and often used as the end part of the information retrieval systems, being retrieved by search system rather than full document. Their target should be to aid the user for deciding whether the reading for the original document is valuable or not. The typical length of an indicative summary ranges from 5% to 10% of the whole text. Dissimilarly, informative summaries deliver a condensation for a complete document, retaining significant information, while decreasing its volume. An informative summary is normally 20-30% of the original text [3].
The main contribution of this paper is to model the multi-document text summarization task as an optimization problem. The proposed model emphasizes the discovery of essential sentences that cover the main topic of the document collection while transcending the occurrence of redundant sentences. Different integrations of double metric similarity measure are introduced to the proposed model for measuring similarity to improve system performance. A binary-encoded genetic algorithm has been adopted to solve the modeled optimization problem. The organization of this paper is as follows. Section 2 presents the related works on extractive summarization. Elementary concepts for extractive multi-document text summarization together with the statement of the problem are introduced in section 3. Section 4 introduces the details of the proposed mathematical formulation and modeling. The proposed genetic algorithm for solving the optimization problem is introduced in section 5. The experiments performed and results are presented in Section 6. Finally, conclusions and some possible extensions to the current work are given in Section 7.

Related works
In literature, multi-document summarization approaches vary in their essence. Various extractionbased techniques have been proposed for generic text summarization [4]. In extraction based document summarization, generation of the optimal summary can be regarded as a combinatorial optimization problem wherein finding a solution to the problem is NP-hard. A review of the works based on optimization and are the most related to the method proposed in this paper is illustrated in what follows.
Alguliev et al. (2011) presented a document summarization model aimed at extracting significant sentences from a given collection of documents while performing reduction of information redundancy in the summary. An inventive aspect of their model lies in its capability to eliminate redundant information while choosing representative sentences. The representation of the model was performed as a discrete optimization problem. For solving the discrete optimization problem in their work, they created an adaptive Differential Evolution algorithm. They implemented their model on the task of multi-document summarization. Their experimental results showed that their proposed optimization approach was competitive on the DUC2004 and DUC2002 datasets [5].
ALGULIEV et al. (2011) proposed an unsupervised model for text summarization which performs generation to a summary by means of an extraction to the significant sentences in given document(s). They modeled TS as an integer linear programming problem. Their model has the ability for covering the core content of the collection through discovering the important sentences in it. This model also guaranteed that the summary cannot involve several sentences conveying similar information [6]. ALGULIEV et al. (2013) achieved a modeling to document summarization as nonlinear and linear optimization problems. These models attempted balancing diversity and coverage in the summary. The optimization problem was solved through developing a new particle swarm optimization (PSO) algorithm. Their experiments revealed that their proposed models produced very competitive results, which considerably outperformed the NIST baselines [7].
In ALGULIEV et al (2013), a model based on optimization for generic text summarization has been proposed. Their proposed model generated a summary through performing an extraction of significant sentences from documents. This method has been used for selecting significant sentences from a given collection of documents and reducing summary redundancy; the sentence-to-sentence, the summary-to-collection and the sentence to document collection relations. An improved differential evolution algorithm has been created for solving the optimization problem. For their proposed work, an adaptive adjustment could be performed on the crossover rate by the algorithm in accordance to individual fitness [1].
ALGULIEV et al (2015) presented an unsupervised optimization based method for automatically summarizing text. They modeled text summarization is a Boolean programming problem. In their model, three properties were attempted to be optimized, namely relevance, reducing redundancy and creating a summary with bounded length. Their proposed method was applicable to multiple and single-document summarization [8].
Asad Abdi et. Al. (2015) proposed a specialized method that works well in assessing short summaries. Their proposed method integrated the semantic relations between words and their syntactic composition. As a result, the proposed method was able to obtain high accuracy and improve the performance compared with the current techniques. Experiments showed that their work was preferred over the existing techniques [2].
In Rautrayand Balabantaray (2017), a novel Cat Swarm Optimization (CSO)-based multi document summarizer was proposed to address the problem of multi document summarization. The proposed CSO-based model was also compared with two other nature-inspired summarizers, namely the Harmony Search (HS)-based summarizer and Particle Swarm Optimization (PSO)-based summarizer [9].
Text summarization was modeled by ALGULIEV et al. (2019) as a two-stage sentence selection model constructed on optimization and clustering methods. Firstly, for discovering all topics in the text, they clustered the set of sentences through applying k-means method. Secondly, to select significant sentences from clusters, they proposed a model based on optimization. An objective function expressed as a harmonic mean of the objectives enforcing the coverage and diversity of the selected sentences in the summary was optimized in their optimization model. For providing the summary readability, their model also controlled the length of the chosen sentences. The optimization problem was solved through developing an adaptive differential evolution algorithm with a new mutation approach [10].

Extractive generic multi-document text summarization 3.1 Preliminaries
Several methodologies have been explored for text similarity, however, they are centered around four major categories. These are word co-occurrence/vector-based methods, corpus-based methods, hybrid methods, and descriptive feature-based methods [11].
In text summarization, vector-based methods are commonly used [12]. Let { } represents distinct terms in a document collection. Cosine similarity is the most popular measure that evaluates text similarity between any pair of sentences being represented as vectors of terms. For a set of different terms composing sentences of a document collection , cosine similarity associates weight to term according to its magnitude in sentence . Cosine similarity metric can be formulated, according to term-frequency inverse-sentence-frequency scheme ( ), as follows [12]: (1) where: : is the measure of how frequently a term occurs in a sentence , and ⁄ is the measure of how few sentences contain the term . Intuitively, if a term does not exist in sentence , should be zero. Measuring the similarity between words, sentences, paragraphs and documents is an important component in text-associated research and applications in several tasks, including text classification, text summarization, IR, document clustering and others. Calculating similarity between words is an essential part of measuring similarity between texts, which is used later as a primary stage for calculating similarities between sentences, paragraphs and documents [11].
Similarity between words can be satisfied lexically and semantically. Lexical similarity between words can be occurred if they have a similar character sequence. Whereas semantic similarity can be occurred if the words have the same meaning used in the same context [13]. For the model proposed in this paper, similarity between two texts has been measured using Cosine, Jaccard and Dice similarity. Cosine similarity is a measure used for computing the similarity between two vectors. This is achieved through calculating the cosine of the angel between them. Hence, if the inner product is used for finding the distance between two vectors, the cosine is used for finding the angel between these vectors. Using cosine similarity is a good technique for ranking documents through discovering the closest document to the user query [14].
Jaccard Similarity is a statistical similarity measure between sample sets. It performs a comparison between members for two sets to discover the shared and distinct members. Although its interpretation is easy and it is very sensitive to small samples sizes, it might provide incorrect results, particularly with very small data sets with missing observations [15].
Dice Similarity is similar to Jaccard and used for finding the similarity between two vectors, but " gives twice the weight to agreements" [16,17,18].

Problem statement
Consider a collection of documents comprising documents, i.e. { } . Also, consider that is totally composed of sentences. In the language of sentences, can be then denoted by { }, wherein refers to the number of different sentences contained in all documents in . The objective of the proposed work is to generate a summary ̅ while tackling three challenges:  Covering Contents: the generated summary ̅ should cover the main topic of the collection .  Reducing Redundancy: the created summary ̅ should not involve similar sentences contained in .  Bounded length: length of the summary ̅ should be restricted.

The proposed model: definitions and formulations
In this paper, the text summarization problem is addressed as a single objective optimization problem. The intended summary ̅ is projected in the light of the defined problem as in the definitions of the proposed SOO based model introduced in what follows. Definition 1 (Summary ̅ ). Let be a sentence to be involved in ̅ , then the content coverage, stated by the summation of similarity for each pair of sentences: between and the set of sentences in the document collection (represented by its mean vector ) and ( ) between and the set of sentences in the document collection should be maximized. Alternatively, reduction of redundancy, or quantitatively, the similarity ( ) between the same pair of sentences that belong to ̅ should be minimized. Now, to formulate our proposal, the problem of text summarization will be modeled through the definition introduced in what follows: Definition 2 (text summarization problem ∑ The SOO based model aims to include in the candidate summary the pair of sentences that gain high similarity to the main contents of the document collection in order to satisfy content coverage and, simultaneously, achieve low similarity between each other in order to introduce diverse ideas to the candidate summary.

The proposed similarity integrations
Different integrations of similarity measures are introduced and applied to the proposed model for measuring similarity, including:  Single similarity measures integration: These metrics measure the similarity between a pair of sentences and between a sentence and the center of document collection through implementing individually the (2,3,4) for Cosine, and Dice similarity measures, respectively.  Double similarity measures integration: These metrics measure the similarity between a pair of sentences and between a sentence and the center of document collection through implementing formulas that are considered as weighted sum equations of two similarity measures under consideration: (Cosine and ), (Cosine and Dice) and ( and Dice).

The proposed genetic algorithm
Each genotype solution in the proposed GA is encoded using binary encoding and characterized by a fixed-length vector of size , wherein each gene value is an indicator to the existence or nonexistence of its related sentence. Then, the entire search space for the proposed GA can be calculated by the Cartesian product of existence/nonexistence of all sentences: ∏ { } (10) Consider a population of genotype solutions, . Then, The description of the proposed GA can be stated as a process expressed in an iterative function with , where is the population at iteration . The evolution function at every iteration will be composed of three key operators: selection, crossover, and mutation operator, wherein their corresponding control parameters control each of them. Formally, this is noted as: (11) Through the application of the selection operator, , copying the good quality chromosomes that are the fittest to the next generation is performed for improving the average quality of the population, whereas elimination of bad chromosomes is performed. The proposed work adopts the tournament selection wherein a selection is made to only one individual for the next generation if it is the fittest from several randomly chosen individuals. The control parameter determines the number of randomly chosen individuals, i.e. tournament size.
The proposed algorithm adopts the Uniform Crossover. In accordance to this type of crossover, the creation of each gene of the child chromosome is performed through randomly selecting the corresponding gene from one of its parents. Both parents have an equal chance for contributing in the creation of the chromosomes that are produced from them. The control parameter determines the crossover rate.
The best solution (in terms of maximum ), of the final generation of GA can be selected as the result to the maximization problem, which is formally specified as: (12) Though, the phenotype of the best solution may still suffer from violating the length constraint: ∑ (13) 6 Experimental results 6.1 Requirements and parameter setting The proposed system has been coded in C# and the environment is Microsoft visual studio ultimate 2013. The experiments were executed on a THINK-PC Lenovo z5170 with Intel core i7-5500 CPU 2.4GHz and a Memory of 8 GB RAM, HDD: 1TB and Video card: AMD Radeon 4GB. GA's parameters have been set as follows: a population of =50 individuals is used and evolved over a sequence of =100. For the tournament selection, a tournament size equals to 2 has been chosen. Crossover probability and mutation probability are set to =0.7 and =0.1, respectively. The overlapping parameter used for applying Dice and similarity has been set to . Qualitative evaluations of the proposed two models were made quantitatively based on the multidocument summarization datasets provided by Document Understanding Conference , particularly using dataset . A brief statistics of the dataset is given in

Evaluation metrics
The proposed work is quantitatively measured using Recall-Oriented Understudy for Gisting Evaluation evaluation metric. is considered as the official evaluation metric for text summarization by DUC. It includes measures that automatically determine the quality of a summary generated by computer through a comparison made between it and human generated summaries. The comparison is satisfied by counting the number of overlapping units, such as , word sequences, and word pairs between the summary generated by a machine and a set of reference summaries generated by humans.
is an Recall counting the number of matches of two summaries, and it is calculated as follows: where stands for the length of the , is the maximum number of co-occurring in candidate summary and the set of reference summaries, and is the number of in the reference summaries. For the work proposed in this paper, ROUGE-1 and ROUGE-2 have been used for evaluating the performance of the proposed system and for performance comparison with other states of the art methods. Table-2 and its related figures record the average  scores of the proposed model  wherein the similarity has been calculated using single metric similarity measures: Cosine, and Dice similarity, while the performance has been evaluated using dataset and represented by an average of 20 different runs with the same parameters.   Table-2 and its related figures, it is obvious that the proposed system performs better using Cosine similarity for measuring text similarity in terms of Rouge-2, whereas better performance has been recorded in terms of Rouge-1 using similarity also for Dice similarity. Thus, these results encouraged us for introducing different integrations of these similarity measures and applying them for the proposed model in order to measure similarity to improve its performance. Tables- (3, 4 and 5) and their related figures record the average scores of the proposed model wherein the similarity has been calculated using double metric similarity measures generated from introducing different combinations regarding Cosine, and Dice similarity, while the performance has been evaluated using dataset and represented by an average of 20 different runs with the same parameters, taking into consideration the value of through using step of 0.1. The summarized results shown in Table 6 are the highest scores recorded from applying the three integrations to the proposed model in terms of Rouge-1 and Rooge-2. Values from 0.1 through 0.9 have been considered for σ.

Figure 4b
Average scores resulted from applying using the integration of and Dice similarity measures and implemented on dataset. the proposed system, the system has recorded the best performance. Whereas for the integration , when is set to 0.1, the best performance has been recorded for the proposed system.
The detailed results recorded in Tables 3 through 5 for evaluating the performance of the proposed model using different integrations of double similarity measures clarify the positive impact of measuring similarity between texts through the integration of more than one similarity measure against single similarity measure, wherein the proposed model recorded higher performance using compared to at all scores.

Conclusions
Automatic text summarization system has the challenge of producing high quality summary. In this paper, the design of a generic text summarization model based on sentence extraction was redirected into more semantic measure reflecting individually the two significant objectives: content coverage and diversity when generating summaries from multiple documents as an explicit optimization model. The proposed two models have been then coupled and defined as a single-objective optimization problem. Also, different integrations of similarity measures have been introduced and applied to the proposed model in addition to the single similarity measures for measuring text similarity involving double similarity measures integration.
Positive impact has been shown through applying different integrations of similarity measures for measuring similarity in the proposed SOEA-based model. When a single similarity measure represented by Cosine, or Dice similarity was applied for the proposed SOO model to measure text similarity and the performance evaluated, it was noticed that the proposed system has performed well in either Rouge-1 or Rouge-2. Whereas applying an integration of two similarity measures has improved the performance in terms of both Rouge-1 and Rouge-2.
The proposed work may be Extended or extra improvements may be added to it through a number of ways represented by the directions recorded in what follows: Improving the tasks of the preprocessing phase has a positive impact on the improvement of the overall text summarization system and will produce summaries with high quality. The focus may be on adding further rules to the stemmer to improve stems quality, or on dealing with punctuation marks via some effective schemes. Also as a future work, applying the proposed system for the summarization of Arabic texts via working on preprocessing phase through considering the rules dedicated for segmentation, tokenization and stemming of texts in Arabic. Moreover, additional objectives can be taken in consideration by the proposed model. For instance, coherence and cohesion objectives are examples of such objectives to be optimized simultaneously, in addition to the content coverage and redundancy reduction objectives.