Enhanced Supervised Principal Component Analysis for Cancer Classification

Abstract In this paper, a new hybridization of supervised principal component analysis (SPCA) and stochastic gradient descent techniques is proposed, and called as SGDSPCA, for real large datasets that have a small number of samples in high dimensional space. SGD-SPCA is proposed to become an important tool that can be used to diagnose and treat cancer accurately. When we have large datasets that require many parameters, SGD-SPCA is an excellent method, and it can easily update the parameters when a new observation shows up. Two cancer datasets are used, the first is for Leukemia and the second is for small round blue cell tumors. Also, simulation datasets are used to compare principal component analysis (PCA), SPCA, and SGD-SPCA. The results show that SGD-SPCA is more efficient than other existing methods.

A matrix can be constructed using a set of principals , so an observation can be projected onto the column space of , -. The projection of onto can be seen as a linear system of equations, i.e., ̂ (1) where ̂ and are unknown parameters. Eq.1 is a linear system, and it has an exact solution, ̂, if lies in the column space of ( ( )), or * +. Otherwise, there is no solution for Eq.1, then it should be solved for projection onto ( ) or * +, and then its reconstruction. Let us define the difference between and ̂ to be the residual ( ̂). The value of needs to be small, which can be gained when is orthogonal to ( ). Hence, Since is a set of orthonormal vectors, then and , so form Eq.1 and 2: ̂ ( ) (3) For projected data points, * ̂+ : where is a covariance matrix and is the variance of the projected data points onto the PCA subspace. Our goal is to find a projection direction that maximizes the variance of projection (squared length of reconstruction), i.e., is a goodness-of-fit that measures how represent the original variables lower-dimensional space. should be small, and a large proportion of the total variation in is explained by the first principal component [7].

Supervised Principal Component Analysis
As stated in the discussion in the previous section, PCA finds the direction of maximum variation ofdimensional space; this can be used as a reduction and pre-processing operation for classification. PCA is an unsupervised classifier, unlike Fisher Discernment Analysis (FDA) [7]; however, SPCA is a generalized method of PCA. SPCA has some advantages over FDA, and it can use label information for classification tasks. The sequence of principle components that have the maximum dependency on the response variable can be estimated using the SPCA [3].
Suppose that * + , where . is not restricted to binary classes, therefore it is not required that has only discrete values, and hence the model can be used for regression as well. In regular PCA, a lower-dimensional subspace has been looked at, where matrix is the covariate matrix and is an orthogonal projection matrix. However, in SPCA, the projection matrix should be determined where ( | ) ( | ), which means a subspace that contains approximately the same information as the original. Between the original covariate and the predictive information must exist. If and are entirely independent, the regression or classification could not be processed.

Using
√ , the steps of SPCA can be achieved as follows; first, the standard regression coefficients for each can be computed. Then, corresponding to all the columns, the data matrix is reduced where | | , and can be found by cross-validation [3]. Now, for the reduced data matrix, , compute the first principal component which can be used in a regression model or a classification algorithm to produce the outcome. SPCA is consistent, unlike standard PCA; PCA takes different directions for the component as the number of data points increases [15]. The modified SPCA can be derived using the Hilbert-Schmidt independence criterion that is discussed below.

Kernel Supervised PCA
The linear projection for PCA might not be completely effective when the data points exist in nonlinear space. Two options are applied to handle this problem. First, PCA should be changed to be a nonlinear method. Second, the data points should be changed to fall on a linear or close to linear subspace. The second solution can be achieved by mapping the data points to space with higher dimensionality, hoping that it falls on a linear manifold. To find a linear transformation such that has maximum dependence of a linear kernel on and a kernel over (call it ) can be made. We attempt to find that maximizes ( ) ( ) ) adds a constraint , where will be the top eigenvectors of . Notice that, if is chosen, then ( ̅ )( ̅ ) , which is the covariance matrix of , and hence it can be concluded that PCA is a special case from SPCA [16]. The possible kernel functions that can be chosen are the linear kernel, polynomial kernel, Gaussian kernel, and delta kernel. The following algorithm shows the necessary steps for the SPCA. , ( ) ( )

Hilbert-Schmidt Independence Criterion (HSIC)
HSIC was introduced by Gretton [17]. It is an independent criterion that measures the independence of variables and . Barshon used it for a supervised PCA technique [3]. If , where each entry in row of is the mean of row of . The idea is based on the useful features that show the maximized independence between two distributions [12]. Measuring the independence between two distributions can be performed using different techniques. In general, two distributions are different if their means are different, but if the two means are the same, then the second moment of these distributions needs to be checked. Now, by calculating the difference between ( ( )) and ( ( )), we can find out whether the two random variables and have the same distribution or not, i.e. and have the same distribution if ‖ ( ( )) ( ( ))‖ is equal to .

Stochastic Gradient Descent SPCA (SGD-SPCA)
The goal is to find the directions in which the variance, ( ), is maximum. Let be denoted as the unit vector direction along which the variance is maximum. The variance along this direction is given by: Evaluate the gradients to , then it follows that: ( ) ( ) and set it to zero to find the optimum values.
We solve the optimization problem whose objective function is given in the following equation: where ( ) is a loss function, is data matrix, are dependent variables, is the basis for the learned subspace, is learned coefficient for prediction, and is a trade-off parameter. If we consider the case where ( ) ‖ ‖ is the Frobenius error loss, then Eq.10 becomes ‖ ‖ ‖ ‖ (11) The derivative of the objective function is where is the projection matrix onto the orthogonal complement of the span of . The retraction is the key notation for applying the SGD-based optimization method to the manifold optimization [18]. A step in the direction of the negative gradient is taken for a retraction. For moving between two points on the manifold, if a closed-form expression is available, the updated step can be taken directly, which is called the geodesic step [19]. Edelman gave an expression for the geodesic step, as follows: where Armijo backtracking line search chooses the step size [20]. The processes of SGD-SPCA at each iteration can be summarized as follows: i. At the current iteration, calculate the Euclidean gradient. ii. Obtain the Riemannian gradient by projecting the negative Euclidean onto the tangent space. iii. For the resulting matrix, compute the singular value decomposition. vi. With Armijo line search, update by taking a geodesic step. The SGD-SPCA can be summarized in the following algorithm: Algorithm 2: SGD-SPCA Input: are initializations that are generated by PCA Output: is a radiused dataset 1.

Simulation Studies
The performance for the methods (PCA, SPCA, and SGD-SPCA) was checked using three simulation datasets, generated using the same developed model used in Bair [1]. They considered 500 variables (genes) and three different values of individuals (50, 100, 200). The response was designed to have two classes. A comparison between PCA, SPCA, and SGD-PCA is presented in Table 1. As can be shown, SGD-SPCA performs better than SPCA and PCA. For instance, in the 200 samples dataset, the first principal component (PC1) captures 62.56% of the variation using SPCA, whereas it captures only 48.51% of the variation using PCA. The sensitivity (also called the true positive rate) and the specificity (also called the true negative rate) were checked. The best scenario of the ROC curve exists when the area under the curve (AUC) is equal to 1. PCA, SPCA, and SGD-PCA were applied to the simulation datasets with 200 samples. Figure-1 shows that SGD-SPCA satisfies the best scenario with 82.9% AUC. It indicates that SGD-SPCA provides the best classification.

Experimental results
SPCA and SGD-SPCA were applied on two real datasets, which are Leukemia and SRBCT datasets, downloaded from the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/index.php). In the following two sections, a brief description for each dataset is outlined. behavior. The three types of blood cancer, namely leukemia, lymphoma, and myeloma, seriously damage the circulatory and lymphatic systems. They are classified into different types which affect different types of white blood cells. Part of this work focuses on leukemia, where patients generate high numbers of abnormal white blood cells that are not functional and hence cannot fight infections. Based on the impacts and growth of white blood cells, leukemia is divided into four types, which are Acute Lymphocytic Leukemia (ALL), Acute Myeloid Leukemia (AML), Chronic Lymphocytic Leukemia (CLL), and Chronic Myeloid Leukemia (CML) [21]. The gene expression dataset that was analyzed in this paper includes data from AML and ALL patients, published by Golub [14]. The data are derived from a proof-of-concept study, and it shows how the gene expression monitoring (via a DNA microarray) can classify the new cases of cancer, providing a common approach for assigning tumors to known classes and identifying new cancer classes. Using this type of dataset, patients were classified into AML and ALL categories. The complete leukemia dataset has genes and observations. The low number of observations does not allow much flexibility for supervised methods, given the need to split the dataset into training and testing parts. The raw dataset was processed by following the following steps, based on the original paper [22];

SRBCT dataset
SRBCT dataset contains the gene expression of 83 observations (patients) with variables (genes). The correct clinical diagnosis is extremely challenging for the four different childhood tumors because of the similar appearance on routine histology. The tumor types include the Ewing family (EWS), rhabdomyosarcoma (RMS), neuroblastoma (NB), and Burkitt lymphomas (BL). In this paper, the distinction between these four tumors is achieved based on gene expression values. The dataset was split into 63 training samples (23 EWS, 20 RMS, 12 NB, and 8 BL) and 20 testing samples (6 EWS, 5 RMS, 6 NB, 3 BL).

Selection of Discriminatory Features
Working with large datasets faces many difficulties, such as time consumption and inefficient results. To analyze leukemia and SRBCT datasets, we selected the most significant genes for cancer type; in other words, the genes that are differentially expressed across classes. The HSIC process (section 3.2) was used for leukemia datasets, which demonstrated only genes as significant. As could be observed from Figure

Modified t-test
In this section, a modified t-test is used to select the most significant genes in the SRBCT data set. The common t-test was proposed by Welch. It is used to measure the difference between two groups of samples. Based on Eq.14, t-test calculates a score, , that represents gene .
Here, ̅ and ̅ are the mean expression values for a gene in two different classes. and denote the number of samples. There are two limitations to the usage of t-test. First, t-test solves problems with only two classes. Second, from Eq.15, if the mean of the two classes is equal, the value of , and then the gene will be removed as an irrelevant gene, whereas it might be able to provide classification information for samples. t-test was modified to overcome the abovementioned two problems, as follows: where ̅ ∑ ̅ , ̅ ∑ , and √ ∑ ∑ ( ̅ ) . and refer to the number of classes and samples, respectively. Class , that includes samples, is denoted by . is the pooled within-class standard deviation for gene . ̅ is the mean expression value for gene , ̅ is the mean expression value for gene in class , and ̅ is the mean expression value for gene in sample . Eq.16 is used to calculate , which is the score for each gene. The genes with high scores were selected for further processing because they are more relevant to the classification. 208 genes were determined as essential genes in the SRBCT dataset. The most genes in the SRBTC dataset are listed with their descriptions in Table-2. Figure-3 illustrates the correlation between the selected genes for training data in Leukemia and SRBCT datasets. As can be seen, some genes are highly correlated.

Conclusions
The present work proposed a new SGD-SPCA method for reducing the dimensionality of large real cancer datasets. Stochastic gradient descent was used to modify the SPCA techniques. The experimental result show accuracy values between 93 and 94 percent using four principal components for both leukemia and SRBCT datasets. A comparison between the modified and some other existing methods proves that SGD-PCA satisfies the criteria of best accuracy and less time.