Breast Cancer Detection using Decision Tree and K-Nearest Neighbour Classifiers

Data mining has the most important role in healthcare for discovering hidden relationships in big datasets, especially in breast cancer diagnostics, which is the most popular cause of death in the world. In this paper two algorithms are applied that are decision tree and K-Nearest Neighbour for diagnosing Breast Cancer Grad in order to reduce its risk on patients. In decision tree with feature selection, the Gini index gives an accuracy of %87.83, while with entropy, the feature selection gives an accuracy of %86.77. In both cases, Age appeared as the most effective parameter, particularly when Age<49.5. Whereas Ki67 appeared as a second effective parameter. Furthermore, K-Nearest Neighbor is based on the minimum error rate, and the test maximum accuracy for K_value selection with an accuracy of 86.24%. Where the distance metric has been assigned using the Euclidean approach. From previous models, it seems that Breast Cancer Grade2 is the most prevalent type. For the future perspective, a comparative study could be performed to compare the supervised and unsupervised data mining algorithms.


Introduction
Breast cancer (BC) is increasing cell size and dividing it out of control. It can infect both males and females and is the most common disease-causing death among women in the world. Early testing can increase protection from disease and make treatment more useful. In spite of trying to develop treatment but advance BC still hard goals of therapy range from symptom palliation to extending survival [1] [2]. Many reasons increase the risk of BC, like food, age, menarche, menopause, inheritance, and the number of children [3]. The diagnosis and prognosis of BC take a great deal of time for researchers. Machine learning (ML) and data mining (DM) play a great role in the detection and prediction of BC [4], such that DM analyzes a large amount of data and can help in the early detection of BC [5]. In this research, two classification algorithms were used: Decision Tree ( ) and − Neighbor ( ) applied to the data to provide efficient detection of BC grade. In the first step, data is loaded and then preprocessed by removing noise data (clearing data) and treating missing data [6][7] [8]. The most suitable method for treating missing data is imputation [9]. In this research, most frequently, strategy is used instead of mean strategy in order to keep the reality of data. After that, the data was divided into two sets: the train set and the test set, and then DM algorithms were applied and the results were combined to get the final result.
The remaining parts of this research are organized as follows: Section 2 demonstrated the related works that have been done in this field. Section 3 explains the theoretical parts of the ML techniques used in this research. Section 4 performs analysis of data and preprocesses it, which deals with noise in data and missing data. Meanwhile, section 5 process steps involve three phases: splitting data, building a model, and evaluating the model.

Related Work
DM plays a great role in many areas like physiological data [10][11] [12], gene/protein position dataset prediction, molecular bioactivity estimation for drug development [13] [14], the colon cancer and leukemia dataset [15] etc. Machine learning techniques are applied to multiple medical fields to enhance medical decision-making. In bioinformatics science, DM has important studies in the cancer area [16] [15]. For example, using and technique to detect BC [17]. Open source data is available for BC like Surveillance Epidemiology and End Results (SEER) instances [18], Wisconsin Breast Cancer (WBC) datasets [19], and Wisconsin Breast Cancer Diagnosis (WBCD) datasets [20]. This research concerns BC dataset, which was obtained from the Medical City Center in Baghdad. Different research in BC applied several DM techniques and uses different BC dataset resources as illustrated next. In [19], DT, KNN, and Naïve Bayes (NB) algorithms were used. The results show that DT gives best result than other two algorithms such that DT = 93.18%. In [17], the KNN algorithm was applied to the WBC dataset for prediction BC with = 94.35%. Furthermore, in [20], KNN and DT were applied to the WBCD dataset to diagnose whether BC is malignant or benign. Results show that the KNN classifier is the more competent ML algorithm when compared with the Decision-Tree classifier. Additionally, in [21], KNN, and Support Vector Machine (SVM) were applied to the WBCD dataset to identify BC. Both algorithms give good result: KNN = 92.31%, and SVM = 95.65%.

Dataset Description
In this research, the work based on a dataset containing nine biomarkers, which are 67, /2 , , , , , , of surgery, taken from 940 patients suffering Breast Cancer (BC), study was designed in Oncology Teaching Hospital, Medical City, Baghdad, Iraq, in 2014-2016, data written by hand and suffering from missing data, data inconsistency, and noise data (outlier data). A description of these datasets is illustrated in Table 2. Staging is a way for explaining breast cancer spreading, involving the size of the tumor, whether it reaches to lymph nodes, if it reaches distant parts of the body, and what its biomarkers are. Staging can be done either before or after a patient undergoes surgery.
As a diagnosis disease, Breast cancer can be diagnosed through multiple tests, including a mammogram, ultrasound, MRI and biopsy.
Describes the appearance of cancer cells and tissue and there are 3 grades of breast cancer, where 1 meaning that well-differentiated carcinoma, 2 moderatlity differentiated carcinoma, and 3 poorly differentiated carcinoma.
There are 2 basic types of surgery to remove breast cancer Lumpectomy. The surgeon removes the breast tumor and a small rim of normal tissue around it. The rest of the breast remains intact and Mastectomy. The surgeon removes the entire breast. In many, but not all, cases this includes the nipple and areola.
The classification depends on age.

Theoretical Consideration
Sometimes, datasets may suffer from missing data. There are several types of missing data. Missing Completely at Random (MCAR) implies lost data distribution if there is no basement between lost data and its known values. In addition, missing at random (MAR), if lost values are based on known values and not on lost values themselves. As well as Not Missing at Random (NMAR), if lost data doesn't rely on known values or lost values [9]. Treating missing values is performed in the preprocessing phase.
Before starting to build a model data can be used in a helpful way by partitioning data (split data) into two groups called the train set and test set. The training set is useful in building models and attribute groups; it is useful in evaluating parameters, comparing models, and all other actions in order to get a final model. A test set is used as the final step for these actions to evaluate model execution [22]. Basically data is divided either by − cross validation, in which dividing data into − of blocks is called , suppose = 5. Figure 1 illustrates structure of − cross validation. This data type is not suitable for time series data, small data, and unbalanced data, or by using train-test-set which involves dividing input data as a ratio between the train set and the test set. After the data splitting model is built, there are several DM algorithms that are applied to the data in order to obtain results and detect grade the of cancer. In this research, two algorithms, and applied. is a non-parametric supervised algorithm which can be used for classification and regression and always for medical purposes. Structure of consists of multiple nodes, the one named root node. Connection between these nodes by edges such that each one has one incoming edge except the root. Some nodes have one or more outgoing edges called internal nodes or test nodes, and other nodes do not have an outgoing edge. This type is called decision nodes, also called leave or terminal [23] [24]. In a decision tree, each internal node splits the instance space into two or more subspaces according to a certain discrete function of the input attribute values, as shown in Figure 2. Some criteria used to make decision, and which biomarker (attribute) are used as best partition for a train set and this biomarker used as testing in node of a tree, [25] these metrics are Gini index ( ) used to reduce classification error probability (misclassification) used in [26], it is computing in equation 1.
Where, is Gini index, is class labels, is the probability class . As well as, entropy is a randomness or impurity measurement in a system, entropy value is zero if all data is joined to one class, else entropy has a maximum value if the probability for each class is equal, entropy computed in equation 2.
Where, is the probability of class . Information Gain ( ) is dependent on entropy, whether a measurement of attribute can be helpful in classification or not. If an attribute is useful in classification, it causes increased value, and this attribute will be a good choice the splitting process, used by 3 [26][27] [28]. So, will be computed in equation 3 [29].
is a non-parametric supervised algorithm. This algorithm is famous because it is easy to understand. But it wastes memory because it stores the whole training set for classification and time complexity for test time. It saves all the training data and uses the whole training set for classification or prediction. This just saves all the values from the data set. It is important to define metrics for calculation distance between training set and test set [30], [31] Euclidean Distance defined as the square root of the sum of the squared differences between the two points of interest and it is favorite one by expert, the formula is in 2D space computed in equation 4 [32].
Where, = 1 is considered as Manhattan, = 2 is considered as Euclidean. After building a model, it must measure the performance of the system. This is done by using a confusion matrix [33] as shown in Table 1 which gives information about system performance by storing True positive ( ) classifier predicts as positive and is true, False Positive ( ) classifier predicts as positive and is false, True Negative ( ) classifier predicts negative and is true, and False Negative ( ) classifier predicts negative and is false. From these values can compute which determines the ratio of number of instances assigned correctly as positive to classifier to total instances assigned as positive and computed in equation 7, and is ratio between number assigned correctly to class and all instances in the class and computed in equation 8. As well as 1 score defined as weighted average of and , computed in equation.9, moreover, , which gives performance measurement of classifier, thus its computed in equation 10 [34][6], when imbalance dataset Recall and Precision are more useful than accuracy.

Materials And Methods
This step is divided into two phases: the preprocessing and processing phase to perform treatment on the original dataset and apply some DM algorithm to detect BC grade as shown later.

Preprocessing phase
This phase is a first step in treating data, [6] preprocessing involves several operations: data cleaning, data integration, data transformation, and data reduction. As shown in Figure 3, this process is often called data cleaning. [9] In order to deal with missing value either by ignore missing value or delete missing value or by using imputation technique, in this paper deal with imputation technique to fill missing value which contains several strategies such as mean, median, most frequent, and , but practically mean strategy give outlier result, and most frequent strategy gives more suitable results in order to keep reality of data, so this strategy had been applied in this research.

Processing phase
This step is the second phase in DM, which means applying DM algorithms on datasets in order to obtain effective results. This paper is concerned with two algorithms that are and . At first, before building the classifier, the data must be split into a train set and a test set. In this paper, train set = 80% of input data, test = 20% of input data.
A. Decision Tree ( ) Algorithm The observed dataset has been processed by using and the result explored in Figure 4 Table.3, and confusion matrix shown in Figure 5. The classification report gives more exact results than accuracy for classifier strength, easy interpretation, and discovery problems. Support defines the number of samples that belong to each class. However, it seems like a huge tree and difficult to analyze the result. So, the solution to this problem is by performing tree pruning with cost complexity to make tree more sharp features, decrease the size, and accuracy increased as shown in Figure 6 which represents pruning with . Also, its : 87.30 %, : 87.30 %, : 87.30 %, classification report shown in table 4, and confusion matrix shown in Figure 7. But if builds with entropy criterion as shown in Figure  8 with : 84.13%, : 84.13 %, : 84.13 %, confusion matrix shown in Figure 9, and classification report in Table 5. As well as, pruning can be performed with entropy criterion as shown in Figure 10, which has : 86.77%, : 86.77%, : 86.77% confusion matrix shown in Figure 11, and classification report in Table 6. The root node begins in biomarker all patients who have values less than 49.5 will be to the left of the nod, else will assign to the right. Then, left and right children's nodes are also split left child depends on 67 biomarker and right child on biomarker and the patient assigned to the sub tree depends on biomarkers value, this process repeated recursively and called recursive partitioning.    is applied to the same observed dataset. First must determine value neighbor ( ) take range from 1 to square root of number samples, to determine most appropriate value which point has a minimum error and maximum accuracy as shown in Figure 12 and Figure 13 which illustrate ratio between − and error, − and accuracy respectively. From previous figures which showed error decrease when is increased, and accuracy increase when increased and point 13 is the optimal point. So, classifier = 86.24%, = 86.24%, = 86.24%, whereas confusion matrix shown in Figure 14, and the classification report shown in Table 7.

The Results Discussion
This research aims to analyze the performance of classification algorithms for BC based on the observed data, which are described in Table 2 and analyzed by , pruning with , and pruning with entropy, depending on accuracy, Recall, and Precision value which shown in Table 8 seems that pruning  with gives the best result from pruning with entropy, so it will depend in the classification process. A comparison between pruning with and is carried out by this research. In biomarker is the most effective parameter and splitting process depending on its value, if < 49.5 then left child is checked depending on 67 biomarker, else the right child depending on biomarker as shown in Figure 6 and Figure 10. In Euclidean approach for measuring distance between training object and test gives the best result than other approaches, the value of selected in points with higher accuracy and lowest error as shown in Figure 12 and Figure 13.
This study implements and techniques on an Iraqi real dataset, which produces results of less accuracy than the previous research [17,19,20,21] that examined different datasets as presented in section 2. Furthermore, it confirms that has better results than as in [19]. However, the used techniques are suitable for diagnosing BC disease as in [17,19,20,21]. As well as, the results of have some limitations regarding their huge size, which confuses the observer when trying to analyze them. Thus, pruning with cost complexity has been used to overcome this problem. Moreover, this study confirmed that 2 is the dominant disease level depending on results of the classification report in Table 3 and Table  7.

Conclusions
In this study, we applied two different data mining (DM) classification techniques for breast cancer BC detection. The DM performance and accuracy were compared to evaluate the most effective algorithm in classifying the dataset. However, the Python language has been used as a supportive tool for analyzing the observed data with helpful libraries. Thus, from the results, it has been realized that pruning Decision Tree ( ) with Gini index ( ) gives better results with an accuracy of 87.30%, 87.30%, and 87.30%, classification process depend on biomarker, where if < 49.5 left child depends on 67 value, else right child depends on . While K-Nearest Neighbor ( ) gives accuracy 86.24%, 86.24%, and 86.24%, with gives the best result compared to other approaches. It concludes that both algorithms succeed in dealing with , metrics and it seems all results are equal. It is recommended to establish a collaboration bridge between BC medical centers and informatics scientists to find fruitful results. Finally, in the future, it is planned to increase the dataset sample to gain more model efficiency, which may produce more accurate results.