Heuristic Modularity for Complex Identification in Protein-Protein Interaction Networks

Due to the significant role in understanding cellular processes, the decomposition of Protein-Protein Interaction (PPI) networks into essential building blocks, or complexes, has received much attention for functional bioinformatics research in recent years. One of the well-known bi-clustering descriptors for identifying communities and complexes in complex networks, such as PPI networks, is modularity function. The contribution of this paper is to introduce heuristic optimization models that can collaborate with the modularity function to improve its detection ability. The definitions of the formulated heuristics are based on nodes and different levels of their neighbor properties. The modularity function and the formulated heuristics are then injected into the mechanism of a single objective Evolutionary Algorithm (EA) tailored specifically to tackle the problem, and thus, to identify possible complexes from PPI networks. In the experiments, different overlapping scores are used to evaluate the detection accuracy in both complex and protein levels. According to the evaluation metrics, the results reveal that the introduced heuristics have the ability to harness the accuracy of the existing modularity while identifying protein complexes in the tested PPI networks.


Introduction
Many complex systems in all areas of science, including social science, politics, biology and medicine, can be represented as networks. Topological analyses of such complex networks are universal and provide insights in many science studies. Complex systems are usually organized in compartments, which have their own role and / or function. In the network representation, such compartments appear as sets of nodes with a high density of internal links, whereas links between compartments have a lower density. These subgraphs are called communities, or modules, and can occur in a wide variety of networked systems. Finding compartments may shed light on the organization of complex systems and on their function. Therefore, detecting communities in networks has become a fundamental problem in network science. Many methods have been developed, using tools and techniques from different disciplines like physics, applied mathematics, biology, computer and social sciences. However, it is still not clear which algorithms are reliable and shall to be used in applications [1].
As an increasing amount of protein-protein interaction (PPI) data becomes available, its computational interpretation has become an important problem in bioinformatics. Observations show that PPI networks possess invaluable evolutionary insights and information to understand various biological processes and cellular functions. However, prediction of protein complexes, like many other practical optimization problems, falls into the category of strongly NP-hard combinatorial optimization problems that can easily bewilder exact optimization algorithms [2], [3].
Complex network clustering is data clustering in dividing the interested entities into clusters or modules. However, clusters in complex networks are based on both the inter and intra connections densities, while clusters in data clustering are groups of points close to each other in a way forming several local optima. In the literature, modularity [3] and a series of follow-up models have been proposed to measure the quality of a set of predicted intra-dense and inter-sparse subgraphs in a graph. The majority of these works have been applied to community detection in social networks and to complex identification in PPI networks.
The main contribution of this paper is to develop a heuristic ground definitions for modularity that can improve its detection ability. Here we are motivated by the logical consequence of protein neighboring properties and how to exploit and couple such properties with modularity function. In this paper, a single objective Evolutionary Algorithm (EA) [4] is adopted to combine modularity (as an objective function) and the proposed heuristic approaches. The algorithm attempts, with aid of modularity, to identify the global structure of the complexes and with the aid of heuristic functions, to fine tune such complexes. In the experimental results, we show that coupling the proposed heuristic operator as exploiter to capture local structures of the solutions provided by modularity can significantly improve the detection performance of EA. In the remainder of this paper, preliminary concepts relating to complex detection problem in PPI networks and the main interest in the literature towards solving complex detection problem in PPI networks are presented first. These are followed by a closer look into the formal development of the proposed evolutionary based complex detection framework together with the proposed heuristic operators. Experimental results are then provided to support the positive impact of the proposed heuristic definitions to further correct the complex structures of the well-known modularity function. The final section of this paper presents major conclusions and further directions of this work.

Related background
A complex network such as PPI network can be denoted by , of proteins and interactions. In other terms is said to be with cardinality and volume . , can be modeled as undirected graph ( ) of a set * + of vertices, and a set *( )| + of edges. Note that throughout this paper, the terms: tie, edge, link, connection, relation, and interaction are used interchangeably to denote any vertex pair ( ) in . Also, let be the space of all possible partitioning solutions for and let * + be a network partitioning solution belongs to the space with partitions or divisions. Normally, any unsigned graph can be represented by a symmetric adjacency matrix denoted by . Rows and columns of are labeled with the vertices of and assigned with in entry ( ) if vertex pair ( ) is in , and set to if ( ) . From the adjacency matrix a set of direct neighboring lists, * +, can be formed. Each list in the set aggregates connections of all vertices with vertex . Thus, | | ∑ ( ) and | | ∑ | | . Mathematically noted, | | is said to be the degree of , while | | is said to be the volume of . Furthermore, the strength of each node can be specified in more details by | | | | | | , where| | and | | be the intrastrength and inter-strength of node , respectively. Generalizing this to all nodes, implies

Modularity based co-clustering model
Co-clustering or simultaneous matrix partitioning (in contrast to clustering, as depicted in Figure-1) needs a quality function that can capture the embedded distinct sub-matrix structures. The modularity (normally noted as ) model defined after Newman and Girvan, lays the foundation of many existing successful graph clustering algorithms [5], [6]. The purpose of is to capture the hidden structure of sub-graphs or community sets in complex networks by maximizing intra-cluster links while minimizing inter-cluster ones.
Consider partitioning of into a co-clustering solution * + such that each vertex is exactly assigned to one cluster . The impact of in is quantified in two distinct terms. The set of edges between vertices existing in two distinct clusters: ( ) and the set of edges found inside one cluster: ( ) . Then, modularity will award according to the fraction of connections inside its communities as formulated in Eq. 1. The left term in Eq. 1 biases towards a solution that is covered with a set of densely intraconnected modules, i.e. many edges fall within each sub-graph * +. On the other hand, the right term in Eq. 1 expresses that the expected value of the same edge density in with the same community structure * + but fall at random between the vertices should be small. will approach its minimum at if the number of within-community edges is no better than random. On the other hand, values approaching , which is the maximum, indicate strong community structure.
(1) Figure 1-Clustering against co-clustering. Left: clustering means partitioning all data vectors with all their features into (sometimes unknown) disjoint groups. Right: co-clustering, or bi-clustering, means partitioning into a set of (sometimes unknown) blocks each containing a consistent local feature pattern (Note that it is not generally possible to display several bi-clusters at the same time as contiguous blocks).

The proposed heuristic based modularity
This section introduces a heuristic based approach for modularity with three different optimization models. A set of protein-neighborhood related functions is proposed to extend, accordingly, the unveil ability of a single objective EA. First, the main components that characterize the evolution process of EA (solution representation and perturbation operators) are formulated towards solving the problem. Then, the optimization models and the heuristic operator are introduced and formulated to improve the quality of generated complexes in the search space. Finally, the main steps of the proposed EA is outlined.

The proposed EA
Any Evolutionary Algorithm (EA) searches for appropriate solutions from the set of all possible solutions of the problem at hand. Generally, the search for good solutions is performed through individual evaluations, selection, crossover, and mutation operators. The design of such operators would then determine the characteristic of the adopted EA. In this section, the definitions of all components are relaxed for the purpose of complex detection problem in PPI networks.
First, the construction of several, but unknown, number of complexes among the interacted proteins of a given PPI network, is an important issue that the individual representation (i.e. chromosome genotype encoding) should take quite seriously. In the proposed EA, the locus-based representation used in [7] is adopted. A chromosome of the population is defined as a collection of node-node neighbor genes. A single gene in the chromosome is defined by its locus and its allele. Consider a PPI network with proteins. Then, will consist of genes, where locus identifies protein in the network, while its allele value corresponds to the neighbor that has an actual interaction with node in the network, i.e. ( ) . This in turn implies that both proteins and will be in the same complex. The decoding function of a chromosome (chromosome phenotype) outlines one of the possible partitioning of the network into complexes, i.e. ( ) * + . However, could vary from one chromosome to another.
Once the population is created and their individuals are evaluated (according to the modularity in Eq. 1), a set of good population of parents is selected and processed by perturbation operators to create better child individuals. Two main perturbation operators are used. These are crossover and mutation . Uniform crossover is used and achieved with a specified chromosome-wise crossover probability, . Consider two chromosomes ( ) and ( ) to be the two participating parents in the crossover.
With probability , a child ( ) can be generated from the two parents by uniformly mixing their alleles (i.e. performing protein-wise fair combination). This can be formally defined by: where ,is a uniform random number.
The mutation operator imitates the traditional allelic mutation operator which works on allele values and alters, with a specified mutation probability , the allele (i.e. neighbor) of a selected locus (i.e. node). This can formally be specified by:.
where ,is a uniform random number.

Formulation of the heuristic optimization models and operator
Generally, heuristic operator or search heuristic is defined to be a rule that decides which solution, given the current solution, to generate or to visit next based on some heuristic criterion. In evolutionary computation community, the search for designing appropriate heuristic for a given problem is essential and can harness the performance of the algorithm. In the following discussion, we introduce a heuristic operator with three optimization models tailored, here, specifically for complex detection problem. The general characteristic of complexes in PPI networks expresses dense interactions within complexes while more sparse interactions among different complexes. The main purpose of the proposed heuristic operator is to move proteins between the complexes of an individual solution . The movement of the selected proteins should reduce the problems of both sparse intraconnections and dense inter-ties. Thus, the proposed operator works, with a specified probability , on those nodes maintaining un-reliable template in their chosen complexes and move them to other complexes that can participate within their proteins more reliably. Here, we propose three different optimization models on how to define reliability assignment of a node to a given complex.
Consider an individual chromosome ( ) corresponding to a candidate partitioning solution * + with complexes. Let node that corresponds to gene being located in complex , where . Then, node inside complex has a possible reliable interaction assignment that can be expressed as the difference between the impact of the intra-connections and inter-connections: ( ) (4) For and in Eq. 4, three models are proposed to define them. Let us first consider an extended version of the adjacency matrix (discussed in Section 2). A weighted adjacency matrix is constructed using Eq. 5.
The proposed heuristic operator, , (see Algorithm 1), then, moves node to another complex , and where node could maintain, there, the highest reliability assignment, i.e. with the highest difference between intra-connections and inter-connections impact. The proposed heuristic operator ( ), then, can be stated formally as in Eq. 5. Note that when more than one complex can receive node with equal value, then randomly selects any one of these complexes.

Results
The experiments include two commonly PPI networks, denoted by and . has proteins with interactions, while has proteins and interactions. To validate the quality of the predicted complexes generated by the tested EA without heuristic against with heuristic, two sets of golden standard complexes ( and 2) drawn from the Munich Information Center for Protein Sequence (MIPS) catalog [8] are used in the experiments. contains complexes, while 2 is made of hand-curated complexes To evaluate the quality of the detected complexes obtained by the EA, several metrics are used. The predicted set of complexes * + obtained by EA is compared with the golden standard complexes * + of complexes. A predicted complex in the solution overlaps a golden standard complex by an overlapping score ( ). Then, the predicted complex matches the golden standard complex if is equal or larger than a specified threshold, , [9].
where | | is the number of proteins common to both a predicted complex and a golden standard complex.
At both complex and protein levels, the three standard metrics of recall, precision, and F measure are evaluated. The complex/protein levels, / , / , and cumulative -/ are defined. In , the fraction of golden complexes/proteins that are matched to any predicted complex is determined. On the other hand, refers to the fraction of predicted complexes/proteins that are matched to any golden standard complex. A harmonic mean of both ,and is reflected by -.
Other measures can be computed with no dependency to the overlapping score ( ). These are general sensitivity ( ), general positive predictive value ( ), and [9]. General sensitivity ( ) between the set of complexes * + and the set of detected partitioning solution * + is the weighted average of complex-wise sensitivity of all reference complexes (Eq. 21). Similarly, general , with respect to the detected complexes (Eq. 22).
where represents the marginal sum of column . The tradeoff between and can be represented by the geometric . High accuracy (Eq. 23) value requires a high performance for both and . √ (23) Results for all mentioned metrics are reported in Tables-(1 -7) and Figures-(2, 3). The reported results (given in bold) clearly reveal the positive impact of the proposed heuristic operator with the three different versions of models. The proposed heuristic operator extends the applicability of the well-known modularity function ( ) to partition a given PPI network.    Table 6-Performance in terms of for and when is adopted without heuristic, in one hand, and with the proposed heuristics, on the other hand. Threshold of overlapping score is varied from to in step of .  Two additional metrics are also included in Table 7. These are Cross Common Fraction ( ) and Strength of complex structure.
compares each pair of complexes, in which one comes from the golden data ( ) and the second comes from the detected result ( ), to find the maximal shared parts.
measures the intensity of the detected complexes ( ). This measure comes after Figure 2-Performance at the complex and protein levels for when is adopted without heuristic, in one hand, and with the proposed heuristics, on the other hand. Threshold of overlapping score is varied from to in step of . Figure 3-Performance at the complex and protein levels for 2 when is adopted without heuristic, in one hand, and with the proposed heuristics, on the other hand. Threshold of overlapping score is varied from to in step of .