CART_based Approach for Discovering Emerging Patterns in Iraqi Biochemical Dataset

This paper is intended to apply data mining techniques for real Iraqi biochemical dataset to discover hidden patterns within tests relationships. It is worth noting that preprocessing steps take remarkable efforts to handle this type of data, since it is pure data set with so many null values reaching a ratio of 94.8%, then it becomes 0% after achieving these steps. However, in order to apply Classification And Regression Tree (CART) algorithm, several tests were assumed as classes, because of the dataset was unlabeled. Which then enabled discovery of patterns of tests relationships, that consequently, extends its impact on patients‟ health, since it will assist in determining test values by performing only relevant tests. Therefore decreases the number of tests for patients.


Introduction
The Data Mining (DM) approach considered as an important and widespread field, because it provides useful results in several areas. It is also easy to learn and provides many techniques that can be used in different ways. Therefore, it is considered one of the useful sciences in life from which humanity benefits every day in many discoveries. Thus, one of the most beneficiary fields is the medical field. Since, the more obtainable biological data, the more interest in bioinformatics to analyze for more different types of emerged data. First, bioinformatics was used to create and control databases for storing biological information. Then, with more data offered, the task of bioinformatics evolved to analyze them in order to have new hidden knowledge, including chemical tests, protein domains, protein structures and so on [1].

Related work
The scientists have many subjects for the DM techniques. According to [2], the diabetic patients" information from Ulster Community and Hospitals Trust (UCHT) from the year 2000 to 2004 is used to predict how well the patients" condition was controlled. The researchers used Feature Selection via Supervised Model Construction (FSSMC) to decide the important parameters in diabetic control, then the classification techniques, NB, IB1, and decision tree C4.5 were applied to the data. Then, in [3] the dataset is collected from the Ministry of Health, Saudi Arabia. Support vector machine algorithm was applied to investigate which case of treatment is efficient for each age category for diabetes patients. The researchers used the Oracle Data Miner tool to analyze the data. Thereafter, in [4], the researchers were used several algorithms (J48, basic logistics, and MLP) as machine learning approach to analyze real data from several Iraqi breast cancer cases in early detection hospitals using Weka data mining tool. And as a test choice, they employed 10-folds crossvalidation, and a performance metric of a confusion matrix to evaluate the best of the suggested algorithms. The researchers also analyze if after several algorithm iterations, the error ratio decreases. It is lower in the case of MLP algorithm after 5-10 iterations rather than basic logistic, and J48 algorithms. On the other hand, in [5] the implementation of machine learning algorithms, where, the sample contains 370 employees in Iraq, it was preprocessed to represent the class attribute based on the gender value. Two DM approaches, the supervised attribute subset evaluator (CFS) with Greedy Stepwise as a search process, and Gain Ratio Attribute Evaluator with Ranker as a search method, they are utilized to pick the attribute for reducing the feature space. Also, the Apriori and association rule algorithms are then used to classify the key factors driving job apathy.

Methodology
In this article, the relationship among real biochemical tests has been analyzed to help in discovering how they affect each other. As well as, raw data have been used, which has been analyzed for the first time, also multiple preprocessing and DM algorithms for such type of dataset have been proposed.

Data Description
The investigated dataset was borrowed from a private Iraqi clinical biochemistry laboratory in Baghdad city, and recorded as a handwritten hard copy, then it had been converted to an electronic copy. The patients" cases had been described via 71 parameters. Whereas, the parameters had been classified as description into two groups. The first group consists of 66 parameters that present chemical tests, while the second group consists of 5 parameters that present personal information such as patient name, which already exists in the raw dataset. Also, the index, gender, age, and date have been added during preprocessing steps. Moreover, the number of patients" is 11000. Whereas, the number of females is 5343, while the number of males is 5657. Nevertheless, the number of adult is 10450, and the number of children is 550. The dataset is noisy (mixed data type, has huge missing values rate which equal to 94.8%). Also, non-labeled class, high dimension, and large variance in feature values. The tests details had been gained by interviewing a laboratory physician, also by the recorded documents and laboratory guidelines. The data are described as in Table 1. Accordingly, the Ch test is increasing in diabetes, chronic pancreatitis, and hypothyroidism, and it decreasing in chronic anemia, and malnutrition. While, Tri increases in the liver disease, and gout. Also, it decreases in case of malnutrition. As well as the HDL is called "beneficial" cholesterol because it keeps the arteries open and blood flows more smoothly. In other words, the higher the HDL level, the fewer incidents of arteriosclerosis, and angina. On the other hand, the LDL is called " bad" cholesterol, since its increased amount in the blood causes accumulation of fatty deposits in the arteries, which leads to blood decreasing flow that could cause heart attack or stroke. Thus, LDL value could be calculated using equation 1 [7].

( )
As well as, in the case of renal disease, Bu level is increasing in blood. In contrast, the level decreases in cases of liver disease due to its inability to form it [7]. However, Cr increases in some disease such as diabetes, high blood pressure. Nevertheless, there are two types of bilirubin test, Direct and indirect, are measured for an adult to help the doctor determine the treatment for liver disease. The amount of iron present in the blood varies during the day, so this leads to request other tests including Fe and TIBC. Also, TIBC high level means iron deficiency anemia which resulting from bleeding. The reasons for TIBC decline is the cancer of the digestive system. Further, coagulation profile contains PT which represents the clotting time, therefore this test is calculated and the doctor depends on its value in addition to the value of the INR test. The figure below show sample of the pure dataset for 27 patients, with the missing value and high dimensions:

The Preprocessing Methods
In order to specify the preprocessing operations and DM algorithms, at the beginning it is important to understand the needs of the dataset. Python programming language was used to add the index feature, where a unique number was added for dataset indexing by Pandas library, to deal with dataset depending on the index (idx) feature rather than names., the age feature has been added to the dataset based on some tests by clinical laboratory physician support Moreover, as it has been assumed that the tests of children were (TSB, Ca, ALB, Bu, G6PD, RBS), while the remaining tests for adults. Also, the gender feature that was added depending on standard names in Iraq, assuming that common names are for females, because of their majority in Iraq. While, the date feature was added depending on the recorded date in the registry hard copy. Hence, null values should be removed. So, regarding the clinical laboratory dataset, the separating has been performed depending on similarities with features names (group of patients with the same tests), where a number of smaller separated datasets files has been created without null values. The resulting datasets is consisting of many outlier that have been removed by putting a thresholds, which is equal to 50 for sample size and 7 for number of features. Thereafter, the separate datasets have been grouped depending on several common features (biochemical tests) between them. The resulted groups of separate datasets were six. The first group has four common features (Ch, Tri, HDL, LDL) and 14 datasets. Then, the second group has one common feature (HBA1C) and 2 datasets. And, the third group has one dataset with (Iron, TIBC) features. While the fourth group has two common features (PT, INR) and 2 datasets. And, the fifth group has two common features (indirect, Direct) and 2 datasets. As well as, the sixth group has two common features (Bu, Cr) and 5 datasets. However, discretization process has been applied for each common feature to convert data type from numeric to nominal values relying on the standard reference of tests as explored in Table 2, then to be assumed as classes.

Proposed Feature Selection Methods
Features are mixture of noises and effective features. Therefore, feature selection technique was used to remove the noise, and as a result improving the accuracy by finding the highest impact features on the class value with less training time, and memory efficiency. Recursive Feature Elimination (RFE) which is a wrapper method, which starts with all dataset features, builds a model, and ignores the irrelevant feature according to the model. Then, a new model has been built using the rest features, and so on until a predetermined number of features are left, or reach high accuracy in case of using cross-validation [8]. Then, CART algorithm has been used as estimator for RFE, with criterion of "gini", and random state =7. The 10-fold has been used to evaluate the model, so the accuracy for classes such as in Table 3. The RFE has been applied to the datasets which have at minimum two features in addition to the class, so the number of groups of datasets which, RFE has been applied to it, is three. Then, the assumed class with high feature selection accuracy has been selected to be the official class when apply CART algorithm.

Classification and Regression Trees (CART) Algorithm
The Classification and Regression Trees (CART) approach constructs a binary tree, where each internal node denotes a condition on a feature, each of the two branches corresponds to a conditional outcome (true and false), and each leaf node denotes a class label. This algorithm chooses the "best" feature at each node to separate the data into individual classes depending on the criterion such as 'gini' [9]. Also, the "gini" gain can be obtained by measuring the "gini" index for all feature values of which belongs to the dataset. So, for dataset T, the "gini" is determined as in equation 2 [10].

∑
Where, is the number of classes and is the probability of different classes for the dataset samples. Gini split info, which measures the gini index for all feature values, which is determined according to equation 3:

∑
Where, represents the feature value. And, the gain is the same, which is also called gini information gain (gini-gain).

Model Implementation and Evaluation
The CART algorithm has been implemented on the selected features datasets, in addition to the datasets with one feature and class. Then, hyperparameters tuning step which represent the search for a set of optimal hyperparameters, has been applied. The hyperparameters are set before training the model, such as splitting criterion, max-depth, and min-samples-leaf. Here grid search, which is comprehensive search within specified hyperparameters sets, has been used by determine sets of values for hyperparameters by define the range of possible values. Then define max-depth-set = (2,3,4) and min_sample_leaf_set = (0.05,0.06,0.07,0.08,0.09,0.1) such that hyperparameters space = {(2,0.05),(2,0.06),...,(4,0.1)} to choose the best hyperparameters values corresponding to larger cross validation accuracy. The grid search is suitable for low dimensions dataset because of time complexity [11]. 10fold cross validation has been applied with the model, so the metrics for model evaluation for each class"s samples such as (10-fold and testing accuracy, precision, recall, f1-score) are explained as Table 4. However, the bar plots for the features and classes relationships have been presented as Figures 1, 2, 3, 4, and 5, for discussion five patterns of the resulted relationships.

Results Discussion
The results for experiments explore that there is no relationship between total cholesterol test and Triglyceride tests with average of five results for 10-fold accuracy=0.92 and testing accuracy=0.91 (error=0.08, and 0.09). On the other hand, there is a positive relationship between Blood Urea test and Creatinine test with 10-fold accuracy=0.90 and testing=0.90 (error=0.1). Also, there is a positive relationship between international normalized ratio test and Prothrombin time test with 10-fold=1.00 and testing=1.00 (error=0). However, there is a positive relationship between Direct test and indirect test with 10-fold=0.95 and testing=0.98 (error=0.05, and 0.02). Also, there exists an inverse relationship between Iron test and total iron-binding capacity test discovered with 10- fold=0.94 and testing=1.00 (error=0.06, and  0). Furthermore, the Gender has effect in determine the normal values from the standards of the biochemical tests. Meanwhile, the age parameter was the same for all patterns and is similar for adult, with considering that the resulting patterns were for adults only. However, there was no specific date for these patterns.

Conclusions
This paper presented the experiments that could be applied upon this type of dataset. For purpose of discovering hidden relationships patterns among biochemical tests, and detecting what are the helpful algorithms and what are not. Patterns that have been discovered could be useful in diagnostic issues without need for more tests. This would support Iraqi medical physicians in decision making process. Additionally, the proposed algorithms will help the researchers in manipulating such type of data that was not analyzed previously. Consequently, the Classification and Regression Trees (CART) algorithm has been noticed as useful in the clinical field. Indeed, the preprocessing phase had been regarded as a very important part for this kind of dataset investigation due to its owing high noise, null values, and high complex raw data.