Topology Molecular Indices Relationship of Electronic Properties of N-Alkanes and Branched Alkanes

Eight electronic properties; HUMO, LUMO, HOMO-LUMO energy gap, dipole moment point-charge, dipole moment hybrid, molecular weight, heat of formation and zero-point energy of 60 normal and branched alkanes were examined using topology molecular indices. All the electronic properties were calculated using semi-empirical self-consistent molecular orbital theory. The relationship of electronic calculation properties with seven models of topology indices based on degree and/or distance were obtained in terms of their correlation, regression and principal component analysis. Most of the properties were well-modelled (r 2 > 0.82) by topology molecular indices except the dipole moment point-charge and hybrid. The PCA resulted in 7 properties and 60 structures of alkanes that produced two principal components with eigenvalues of greater than 1. The first principal component explained 60.388%, while the second principal component explained 26.457%, bringing a cumulative value of 86.845% to the data variation.


Introduction
A topology index is a mathematical number representing or describing some aspects of molecular structure. Topology indices using the graph theory approach are especially used to explain the isomerism rationalised by the chemical structure theory. The constitutional isomers of the members of certain homologous series, such as alkanes can be represented using analytical forms called trees [1]. The number of alkane constitutional isomers (trees) for the number of carbon atoms 1, 2 and 3 is 1. The numbers of a constitutional isomer of alkanes for the number of atoms 4, 5 and 6 are 2, 3 and 5, respectively. However, the number of isomers increases abruptly with the number of carbons of more than 7 [2]. Therefore, there is a need for a numerical quantity to represent the indices based on various invariants or characteristics of molecular isomers. Topology indices are also widely used in establishing the correlations between the structure of a molecular compound and its physicochemical properties or biological activities [3].
A graph is a discrete mathematical concept to visualise a structure that has a collection of points (vertexes) or lines connected to points (edges) [4]. The vertex degree is the number of connected edges in a particular vertex. In chemical graph theory, the vertex represents atoms, and the edge represents bonds. Weiner carried out a pioneer topolog,y indices which can be easily visualised based on the distance path from hydrogen-suppressed graph [5]. Topological indices are divided into three types: degree-based indices, distance-based indices and spectrumbased indices [6]. The well-known distance-based indices and degree-based indices that have been widely used are Weiner, Randic, Zagreb, Balaban, Schultz and Xu. Since 2000, over 400 topological indices have been used in structure versus property or biological activity [7]. However, topology indices have a major drawback of degeneration, such as two or three chemical constitutions giving rise to the same topology values [8]. To date, several researches have been carried out to establish the novel indices that meet these criteria [9,10]: novel indices should have a direct structural interpretation (so that it can meaningfully contribute to building molecular models); novel indices should involve structural features that existing indices have failed to accommodate adequately, and novel indices should dominate a correlation with some molecular properties. Many indices provided have always been used to calculate the prediction of physicochemical properties or biological activities for n-alkanes and branched alkanes [11].
The electronic structure of n-alkanes and branched alkanes have been the subject of extensive attention because they comprise one of the foundations of chemistry and petrochemistry. In petrochemistry, n-alkanes and branched alkanes are the major constituents of natural gas, crude oil and raw material for processing or manufacturing chemical industry [12]. Since the applicability of n-alkanes and branched alkanes in process or manufacturing, many groups have developed and investigated the electronic properties using experimental or computational quantum chemistry. In experimental approach, the electronic properties of alkanes or branched alkanes have been investigated using vacuum ultraviolet spectroscopy or gas chromatography [13]. Whilst, many researchers have investigated the electronic properties of n-alkanes and branched alkanes using computational quantum chemistry such as time-dependent density functional theory (TD-DFT), symmetry-adapted cluster configuration interaction (SAC-CI), Møller-Plesset and semi-empirical calculation [14][15][16]. The main problem in quantum computational chemistry is to find the wavefunction solution within a reasonable time and using available computational resources [17]. The drawback of this method can be overcome using a quantitative structure-property relationship approach. However, choosing a suitable descriptor is still being investigated [18,19]. Therefore, this study aimed to investigate the impact of molecular descriptors on the electronic calculation properties such as HUMO, LUMO, HOMO-LUMO energy gap, dipole moment point-charge, dipole moment hybrid, molecular weight, heat of formation and zero-point energy of the alkanes. We also evaluated the performance of molecular descriptor based on the topology indices, which are suitable for electronic properties purpose. For the To aid this effort, it was utilised the correlation analysis, principal component, and regression analysis for molecular descriptors (topology indices).

Calculation of Topology Indices
Topology indices are one of the molecular descriptors that apply graph theory. The atoms in the molecular structure are represented by vertexes. The chemical bonding is described by the edge [20,21]. To describe the method of calculation of topology indices, it is illustrated with a specific example, molecules of 2,2-dimethyl butane. This molecule can be represented by graph G(V, E) with vertex (V) and edge set (E). Figure 1(a) shows there are 6 vertexes (V) and 5 numbers of E. Therefore, set V can be represented by {1, 2, 3, 4, 5, and 6} [22]. While set E can be represented by {(1, 2), (2,3), (3,4), (2,5), (2, 6)}. Figure 1(b) shows the degree of the vertex for 2,2-dimethyl butane. This paper described the vertex degree between edge i and j as ui and uj, respectively. The topology indices are defined in equations (1)(2)(3)(4)(5)(6)(7). The example of calculation for 2, 2-dimethylbutane using all topology indices can be found in the supplementary.
The summation is calculated over all possible paths of the degree of vertex u. Therefore, 0  is the sum of reciprocal square of the degree of vertex u. The sum connectivity is defined as [24]: Topology II: Modified Zagreb Indices. The hyper-Zagreb indices are defined as below [25]: In this paper, we define the inverse hyper-Zagreb indices as ( ) Topology III: Weiner indices. The Weiner indices are calculated from the distance matrix [26]. The Weiner indices, w, is given by equation (3a). Randic also improved the Weiner indices calledhyper-Wiener indices [27]. The hyper-Wiener indices, ww is given by equation (3b) [28]: Where dij is the shortest distance between vertices i and j. Weiner also defines the structural variable, known as the polarity number, p. The polarity number is half the number of the pairs of carbon atoms separated by three carbon-carbon bonds or the number of diagonal elements of w with a distance of 3. Another parameter in Weiner topology is Δw which is given as: 0 w w w  = − (3c) where wo is the Weiner indices of the straight-chain member of the group of isomers. While w in equation (3c) is the Weiner indices of the respective isomers. E)] of a graph G (V, E) is related to reciprocal distance indices that is given by equation (4a) [29].

Topology IV: Harary indices
(4a) The diagonal element for the Harary distance matrix equals zero the for carbon atom. While the second-order Harary, indices, hh can be denoted as is calculated using the average-distance sum connectivity and defined as (5) where q is the number q of vertex adjacencies and Di is the distance sum of G (V, E) [30,31].
Topology VI: Molecular Topological Index (MTI Index). MTI index has been introduced using adjacency (A), degree (v) and distance matrix (D) [5]. The adjacency matrix A of G (V, E) element is defined as 1 if ui neighbouring with uj and otherwise is zero. While the matrix v is the sum of each column element in matrix A. Matrix v also represents the degree of a vertex in a matrix arrangement. MTI index usealgebraic matrix operation. The index is simplified using the following mathematical equation:  (7) where si is the distance sum of G (V, E) and vi is the sum vertex-degree matrix of G (V, E). All the calculated topology indices are shown in Table 1.

Calculation of electronic properties
The physical properties examined in this study (the observed value) HUMO, LUMO, HOMO-LUMO energy gap, dipole moment point-charge, dipole moment hybrid, molecular weight, heat of formation and zero-point energy were calculated using semi-empirical selfconsistent molecular orbital theory. These properties were computed using MOPAC2016, Version: 21.002 W, James J. P. Stewart software. The calculation was done using PM6 parameters. The input of the structure and geometry optimization were generated using 3dimension Avogadro version 1.2.0. The observation of the physical properties of the molecular structure is shown in Table 2 Correlation is primarily to determine a relationship or a connection between the variables. Correlation uses covariance which measures the linear association between the variables. The degree of correlation is quantitively measured by correlation coefficient rxy. This coefficient is also known as product-moment correlation or Pearson correlation. The value of rxy can range between -1 and +1. If the slope has a negative value, the linear interrelationship also has a negative value. The relationship has a strong value with the condition that reaches one.

Linear Regression
Linear Regression is the predictive model to determine the best fit linear line between the dependent and independent variable. The linear mathematic equation will be established based on the independent variable via the least square method. The simple Linear Regression equation is stated as in equation 8: where bo is the intercept; b1, b2 … bn are coefficient of n independent variable; x1 , x2 , … are the n independent variable, and y is the predictive (calculated) values. We utilized the standard entry method in this calculation, in which the independent variables were entered into the equation at the same time. Regression equations and other statistical measures were obtained using options in the SPSS software package. The final equations were selected based on their standard errors (SE) and F-value. SE is the expected distribution of an estimated regression coefficient based on the coefficient across multiple data. While the F-value was calculated by dividing the mean regression sum of squares by the mean error sum of squares.
It was used a single-cross validation procedure for the validation of regression [33,34]. The effectiveness of the regression model was estimated from one data sample that would be applied to other selected data samples. The number of samples must be larger than the number of predictors. The data were split into two subsamples of unequal size using random splitting. The correlation coefficient (r2) sample size of approximately 75% of all cases was computed. Another subgroup correlation coefficient (r1) was also computed. The square differences (ds 2 = r1 2 -r1 2 ) between these two values were estimated [34]. The estimated ds 2 explains how well the regression model would predict a future sample and can be very useful.

Principal Component Analysis
Principal component analysis (PCA) was used to determine the inherent dimensionality of the groups of properties. This is a data compression method based on the correlation among variables. It aims to group those correlated variables, replacing the original descriptors with a new set called principal components, PCs, onto which the data are projected. These PCs are completely uncorrelated and are built as a simple linear combination of original variables. It is important to point out here that the PCs contain most of the variability in the data set, albeit in a much lower-dimensional space. The first principal component, PC1, is defined in the direction of maximum variance of the whole data set. PC2 is the direction that describes the maximum variance in the orthogonal subspace to PC1. The subsequent components are taken orthogonal to those previously chosen and describe the maximum of the remaining variance. Once the redundancy is removed, only the first few principal components are required to describe most of the information contained in the original data set. The orthogonal PROMAX rotation was used during the calculation.

Correlation analysis
The correlation analyses examined the electronic properties of HUMO, LUMO, HOMO-LUMO energy gap, dipole moment point-charge, dipole moment hybrid, molecular weight, heat of formation and zero-point energy of the alkanes, as shown in Table 3. The results showed that most of the properties were moderately correlated. HOMO and LUMO had a moderate correlation with each other (0.55147). HOMO-LUMO energy gap (ΔE) had a strong correlation with HOMO. ΔE also had a correlation of more than 0.8 for MW and ZPE. As seen in Table 3, the thermodynamic property of molecular weight (MW) had a strong correlation with the heat of formation (HF) and zero-point energy (ZPE). The correlation for hybrid and point-charge dipole moment was high at 0.9257. These indicate that both are correlated to each other. However, hybrid and point-charge dipole moments have low correlation with different properties.

Regression analysis
In regression analysis, electronic information such as HUMO, LUMO, ΔE, dipole moment point-charge, dipole moment hybrid, MW, HF and ZPE contain the information on the shape of a molecule, represented by molecular topology. The formulation of the multiple linear regression (MLR) model for the property of a molecule is given by: where a and bi are constant contributions to molecular topology. Linear regression was formulated for Xu and MTI indices. Regression was analysed by value or determination (R 2 ), and SE and F-values are shown in Tables 4 -11.

Frontier molecular orbital energies
The relations of molecular topology indices with the energies of the highest occupied and lowest unoccupied molecular orbitals (HOMO and LUMO) were calculated by quantum chemistry methods via PM6, as given in Tables 4 and 5. In contrast, the relation of topology indices with ΔE is given in Table 7. Harary indices showed the highest value of R 2 of HOMO, LUMO and ΔE, which was 0.991. It was followed by Randic and Zegrab indices which showed similar patterns for all values for frontier molecular orbital energies. The R 2 values of HOMO energy for Balaban, Xu and Weiner indices were 0.965, 0.950 and 0.914, respectively. For Balaban, Xu, and Weiner indices, the R 2 values of LUMO energy were 0.968, 0.948, and 0.916, respectively. According to Table 7, for Balaban, Xu, and Weiner indices, the R 2 values of ΔE were 0.966 (for Balaban index), 0.949 (for Xu index), and 0.914 for Weiner index. All the topology indices resulted in R 2 values for frontier molecular orbital energies of more than 0.9 except for the MTI index. MTI showed that the values of R 2 were 0.826, 0.824 and 0.826 for HOMO, LUMO and ΔE, respectively. Harary indices had the highest regression R 2 value of 0.991.
Alkane is a saturated aliphatic hydrocarbon thatentirely composed of sigma(σ) bonds. HOMO and LUMO energy depend on the molecular structure of the molecular orbital band [35]. The electronic transition occurred at the range of less than 200 nm. Harary indices had a high value of R 2 . The high value of R 2 for Harary indices might be because these indices use inverse distance matrix in the calculation, which plays a role in frontier molecular orbital energies. Nasiri reported that the ΔE was inversely proportional to the length of the n-alkane chain [36]. The decrease in HOMO and LUMO energy gap happens as the number of atoms increases, as proposed by Morisawa and co-workers [15]. They proposed that adding atoms in the molecular structure will cause the number of electrons that occupy the orbitals to increase. The increase of electrons in the orbital will induce a high HOMO energy level known as orbital "destabilization". These changes will cause the LUMO energy to either decrease or remain the same to observe the decrease in the HOMO-LUMO gap. An increase in the 'branch' in the alkane structure will cause a change to the Rydberg state, which changes with the distance of the nuclei [37]. The inverse distance matrix (Harary indices) has the highest regression constant. The indices can be represented by the structure of the molecule and the number of carbon atoms in the molecules, which are comparable to PM6 molecular calculation. Randic connectivity indices also showed good regression values. The calculation of these indices was based on the inverse degree of a vertex of carbon atom in the structure, showing that these indices can represent the branch of molecules and the number of carbon atoms. The high order of Randic indices indicates the path tree, in which the analogy is the contribution of the C-C bond that influences the next C-C neighbour by its second neighbour, third neighbour and so on. The lowest value of R 2 for MTI indices is plausibly inexhaustive to represent the molecular branch and the reduction of HOMO and LUMO energies as the carbon atom increases.

Point charge dipole moment
In semi-empirical calculation, the total molecular moment is given by [38] total Q hyb where μQ is the dipole moment due to the contribution of net point-charge located at the nuclear position. While hyb  is the contribution dipole moment from the effect of atomic polarization result from hybrid atomic orbital. Table 7 shows the multiple linear regression equation for μQ. All the topology indices showed the values of R 2 of less than 0.8. The highest R 2 was Randic indices at 0.778 followed by Weiner, Harary, Zegrab Balaban and Xu indices, as shown in Table 7. MTI indices showed the lowest value of R 2 = 0.663, which is lower than 0.7. The point charge dipole moment was related to the net charge distribution and the atoms' position vector. It also considered the electrostatic interaction between the charge interaction in the molecules [38]. The low value of R 2 in the contribution of dipole moment due to the calculation of topology indices was not based on a direct graph [39]. The topology indices did not imply the charge distribution and atomic hybrid orbital. Even though Weiner indices had introduced the polarity index p graph for trees, the value of R 2 was still lower. The graph tree approach for the number of unordered pairs of carbon atoms separated by three carbon-carbon bonds contributed to the molecule dipole moments.

Hybrid dipole moment
The contribution of hybrid dipole moment also showed that the R 2 value was much higher than the point charge of dipole moment. Randic indices showed the highest value of R 2 , which was 0.804. It was followed by Harary, Xu, Zegrab, Weiner and Balaban indices, as shown in Table 8. The high value of R 2 in hybrid dipole moment compared to the point charge of dipole moment was plausibly due to the effect of σ-character of alkanes molecules due to the orbital hybridization in the molecular structure. The lowest value of R 2 was the MTI index, with the value of R 2 as 0.684. We believe that the low value of MTI in both hybrid and point charges might be due to non-isomorphic trees of the same size that may have the same value of MTI index [40].

Molecular Weight
According to the results of regression analyses, all the topology indices showed a good relationship with the molecular weight. All the molecular topology indices had the values of R 2 of more than 0.91 for molecular weight, as shown in Table 9. This indicates that the topology indices are versatile to represent the molecular structure. Gumus and Turker also reported that their TG index correlated with the molecular weight of alkanes and alkenes [41]. The main objective of topology indices is to represent the molecular structure with a numerical value for isomer molecules. Most organic chemical structures have the same atomic composition or molecular weight but a different line or stereochemical formula which have other physical or chemical properties. The representation of molecular weight is not enough for further application in quantitative structure-activity or properties relationship (QSAR/QSPR) due to isomerization in molecular structure. The high value of R 2 shows that all the topology indices are capable of representing the molecular structure as a molecular descriptor for QSAR/QSPR application.

Heat of formation
The heat of formation is the energy required to form one mole of an alkane compound in the gas phase from its elements in their natural state at standard temperature and pressure. The heat of formation of alkanes was calculated using the semi-empirical (PM6) method, showing a strong correlation with molecular structure. The result (Table 10) showed that the molecular topology indices strongly correlated with the value of R 2 of more than 0.98. Bond energy plays an important role in predicting the heat of formation and thermochemistry properties. Furthermore, the bond order also contributes to the value of heat of formation. Randic index is based on the connectivity of carbon atoms in the molecular structure. Zagreb indices are based on the degree of edges. While Weiner, Harary and Balaban indices are based on the distance of vertex and the position of atoms. Meanwhile, MTI and Xu calculation indices are based on the distance of the vertex and the degree of edges. Randic indices showed a high value of R 2 = 0.999. The atom-bond connectivity index also agreed with the heat of formation calculated from ab initio and DFT (MP2, B3LYP) quantum chemical calculation [42].

Zero-Point Energy
All the molecular topology indices had the values of R 2 of more than 0.91 for zero-point energy, as shown in Table 11. These topology indices reflect the effects of molecular structure branches that have a high quality structure-property relationship. The zero-point energies (ZPE) of molecules can be related to the frequencies of the normal vibration modes in the molecular structure. Schulman and Disch reported that the zero-point energy of the hydrocarbons with molecular stoichiometry CnHm could be related to the empirical relation [43]: ZPE = 3.8n + 7.12m -6.19 (11) where n and m are the number of carbon and hydrogen in the molecules. While Rahal et al. reported that ZPE empirical relationship must have adjusted due to the number of atoms and branching [22]. The high values of R 2 in this calculation show that this approach may provide another alternative for empirical relation based on the branches' locations in the molecular structure. Thus, ZPE exhibits the structural characteristics of the bonding, shapes, and the number of atom increments related to molecular weight.

Validation
Cross-validation is a statistical method to investigate the predictive validity of a linear regression equation. Table 12 shows the value of ds 2 for all electronic properties. In this calculation method, the stable value ds 2 must be closer to the value of zero and positive [34,44]. All the topology descriptors showed a positive value and were closer to zero for HOMO energy and molecular weight. The energy of LUMO, Randic, Balaban, and Xu indices showed a negative value of ds 2 . For the HOMO-LUMO energy gap (ΔE), MTI and Xu showed a negative value. All the molecular topology descriptors showed a positive value for point-charge of dipole moment except for MTI. Randic, Zagreb, and Harary showed a relatively large value of ds 2 . Meanwhile, the hybrid dipole moment exhibited a positive value of all descriptors. Two descriptors had a rather significant value of ds 2 , which were Randic and MTI. Three descriptors; Zagreb, MTI, and Xu showed the negative value of ds 2 . Finally, all descriptors showed positive values of zero-point energy except for Harary indices.

Principal Component Analysis
Correlation analyses were performed between the electronic properties of HOMO, LUMO, HOMO-LUMO energy gap (ΔE), the molecular dipole moment of point-charge and hybrid, molecular weight, the heat of formation and zero-point energy of the alkanes. The correlations of the examined electronic properties are shown in Table 3. The PCA results are tabulated in Figure 2. In the analysis for 7 properties and 60 stuctures of alkanes, as indicated in Table 13, two principal components with eigenvalues of greater than 1 were produced. The first principal component explained 60.388%, while the second principal component explained 26.457%, bringing a cumulative of 86.845% to the data variation. From Table 13, the classification of the properties was divided into three main areas. The first one was thermodynamic properties, as Molecular Weight (MW) had a positive correlation with Heat of Formation (HF) but had a negative correlation with Zero-Point Energy (ZPE). The other classification areas were indicated to the molecular electronic properties, which indicate the correlations with HOMO and LUMO. From Table 14, it is shown that HOMO also has a negative correlation with LUMO. This could be due to HOMO and LUMO being known as frontier orbitals. HOMO can be found by locating the outermost orbital containing an electron and LUMO, which is the first orbital that does not have electron. The last classification of the properties is related to the atomic electronic propertiesby which this study focused on point-charge dipole moment and hybrid dipole moment. These two properties showed strong correlations with each other. HOMO-LUMO energy gap (ΔE) showed a negative correlation with HOMO and a strong correlation with LUMO. However, PCA analysis could not be conducted for the HOMO-LUMO energy gap (ΔE) as it would give no positive definite (NPD) to the matrix. The Matrices can be NPD because ΔE has a linear dependency with HOMO and LUMO variables. Table 13 shows the component matrix of the correlations between the seven properties and the components. The results indicate that all the properties except dipole moment point-charge and dipole moment hybrid strongly affected PCA1. However, only dipole moment point-charge and hybrid dipole are significant on PCA2.