Image Retrieval Using Data Mining Technique

Even though image retrieval is considered as one of the most important research areas in the last two decades, there is still room for improvement since it is still not satisfying for many users. Two of the major problems which need to be improved are the accuracy and the speed of the image retrieval system


Introduction
In recent years, the development of visual and multimedia applications led to the widespread of digital images. Also, the revolution of image messaging applications and the usage of photos in social media platforms make thousands of images shared each second, leading to millions of images being accumulated each day. However, managing and organizing these digital images presents a problem.

Abdul-Samad and Kamal
Iraqi Journal of Science, 2020, Vol. 61, No. 8, pp: 2115-2125 2116 Thus, the concept of indexing and retrieval was introduced to overcome this issue. Indexing relates to "how to store images in database to retrieve them (through querying) more efficient", whereas Retrieval relates to "how to retrieve images that are most relevant to the query image from images in database" [1].
Two retrieval methods are used to retrieve digital images from the database. The first method is known as Text-Based Image Retrieval (TBIR) that depends on metadata associated with each image and uses traditional query techniques to retrieve images from the database by a keyword. This method works well with small digital image databases and it is not good with a huge database. This is because it is very difficult and time consuming when performing keyword generation for such databases. The most important problem in TBIR is that different users use different words to describe the same image (subjectivity of the human). This problem will adversely affect the efficiency of the text-based image search [2]. Hence, a need for a more effective image retrieval system has appeared. This needed system must perform an automatic indexing and retrieving. Hence, the second method depends on image content for indexing and retrieving. Therefore, this method is generally known as Content-Based Image Retrieval (CBIR). CBIR also, known as Query By Image Content (QBIC) and Content-Based Visual Information Retrieval (CBVIR) was introduced in the 1990s. It depends on analysis of the visual content of the digital image which can be analyzed by extracting image features such as color, texture and shape, which are called low level features [3].
The data should provide knowledge and information for decision making. Hence, data mining is the concept of data analysis and the process of finding an interesting pattern from a large amount of data. The data is stored in different databases such as data warehouse, World Wide Web, and external data sources. The goals of data mining are fast retrieving of data or information, knowledge discovery from the databases, identification of hidden patterns and other patterns that are not explored before, reducing the level of complexity, saving processing time, and many other goals which are all useful in CBIR. Data mining is occasionally called Knowledge Discovery from Database (KDD) [4].
In order to design and implement generic CBIR applications, both advanced algorithms in image understanding field and advances in computer hardware are needed. Therefore, most efforts are directed to specific CBIR applications [5]. A wide range of CBIR applications varied from personal to medical diagnoses, crime prevention, education, military and many others [6].
This study proposed an approach of CBIR system for both texture and non-texture images. Color and texture features with spatial information are used to analyze the image visual content. The proposed system used a segmentation technique followed by feature extraction. Also, the system employed clustering method on the images in the database. The need of using segmentation technique and clustering algorithm is to speed up the image retrieval process as well as to increase the system accuracy.

Materials and Methods'
The next sections show how to design and implement the proposed system and describe in detail every material and algorithm needed for this work.

Image database
In this work, INRIA Holidays database is used to evaluate the proposed system. This database contains some personal holiday photos. It includes a very large variety of scenes such as pyramids, forests, sunset, boats, etc. The dataset contains 1000 high resolution JPEG images [7].

System Implementation Environment
The proposed image retrieval system is implemented using C# .Net 2015 programming language which is used under Microsoft Visual Studio IDE (Integrated Development Environment). This IDE is installed on Windows 10, 64-bits OS produced by Microsoft company. CSV files (Comma-Separated Values) are used to store the features vectors and produce a feature database. The hardware platform (physical component) that the proposed image retrieval system works on is as follows:  CPU (Intel core i7 2.7 GHz chipset).  Memory (RAM 8GB).  Graphics card (VGA 1GB AMD Radeon).  STORAGE (SSD 240 GB).

The proposed image retrieval system
An overview of the proposed retrieval system architecture is presented in Figure-1. The suggested system consists of two main stages: offline stage and online stage. In the offline stage, the features of images are extracted and saved in features database.

Figuer 1-Flow chart of the proposed system
The extracted feature vectors are clustered into many clusters that have similar values for each cluster. While, in the online stage (retrieval stage), the features of the query image are extracted and compared with the clusters' centroids in the master cluster. Then, the images in the cluster that have similar values are retrieved.

HSV color space
The color is an important descriptor. Using the information of color helps to extract the details of the image, such as the object of interest. HSV color space is widely used in computer graphics and is a more intuitive way of describing color. HSV stands for Hue, Saturation and Value. Hue represents a mixture of two primary colors, one of which is at full intensity. Saturation indicates the mean of the color purity, which is how much of pure spectrum is diluted by mixing with white light/color in it. Value indicates the Chroma notation of intensity that is called as brightness. In a short description, the chromatic information of the color will be given by HS, while V will give the intensity information for that color [8].

Three-dimensional HSV color histogram
The color histogram plays an important role in image analysis. 3D HSV histogram is widely used in image retrieval field since HSV color space is close to the human perception. In this paper, a regionbased histogram is used. Image is segmented into 9 equally sized segments to increase the gained spatial information, and a separated histogram is calculated for each segment. Color histogram is easy to compute and it is one of the most effective descriptors in characterizing the distribution of colors in an image [9]. Increasing the number of bins of the color histogram will increase its power of discrimination. However, the large number of bins will increase the computational time cost and will be inappropriate for building efficient indices for image databases and vice versa. Thus, for better feature extraction, a proper selection of bins is required. The number of bins can be selected directly proportional to the size of the dataset. Less number of bins will be taken if the images' database is small [10]. In this paper, 9 bins are taken for Hue, 5 for Saturation, and 4 for Value. As mentioned before, the chromatic information of the color will be given by HS and V will give the intensity information. For that reason, the bins for HS are selected to be more than those for V. Thus, the total number of features extracted is 9×5×4=180 for each segment and 180×9=1620 for each image.

Color Correlogram
The spatial information of the extracted feature is the main drawback of the color histogram. For example, all the images shown in Figure-2 have the same color proportion, but different spatial distribution. Correlation histogram (correlogram) tries to fix this histogram's drawback by taking the spatial correlation of color distribution into account. It shows how the spatial correlation between pairs of colors is changing with distance [11].
A color correlogram can be represented as a table indexed by color pairs (i,j), where the d th entry specifies the probability of finding a pixel with I color at a distains d form the pixel with j color in the image. Let [D] denote the set of distances {d 1 ,…,d D }. Then the color correlogram for the image I for color pair (c i ,c j ) at a distance d can be denoted as [12]: (1) where: P 1, P 2 are the probabilities of the color occurrence Auto-correlogram shows the spatial correlation between only identical colors in an image. The auto-correlogram of the image I for color Ci at a distance d can be denoted as: ( ) ( ) (2) The experiment shows that both correlogram and auto-correlogram are computationally expensive. Hence, correlogram with a small number of color and distance values still gives a very good result without increasing the computational cost [12]. Thus, it is used in this work.

Row sum and Column sum
Row and column sums are image features that have very crucial information about the image. For two similar images, they are nearly the same. In this work, these features are calculated in RGB color space. At the beginning, the image is resized to 256×256 in order to have the same numbers of rows and columns for each image, and then the red channel, green channel and blue channel are extracted to have a full indication about each color combination in each row and column. Then, the sum for each row and column of each channel is calculated. These features are another type of the spatial features that give an indication about the spatial relationship between pixels [13].

Texture Feature
The spatial arrangement of the pixel within an image is defined as image texture. The GLCM statical approach is used to extract texture features in the current paper. In this approach, gray level spatial dependence of texture is explored. A co-occurrence matrix, I d,θ (i, j), is a matrix in which the (i, j) th element represents the frequency of occurrence of two pixels separated with d distance, and in the direction θ with grey levels i and j. The variations of texture in a region can be captured through the co-occurrence matrix by various θ and d. That is, the co-occurrence matrix characterizes the spatial interrelationships of the grey levels in a textured pattern and it is invariant under monotonic grey-level transformations [14]. The extracted texture features are given in Table-1 [15].

Feature vectors clustering by fuzzy C-means
The clustering of data is a process of grouping data in classes or clusters. Each object in a cluster has high similarity in the extracted features, but objects' features in other clusters are dissimilar. In this work, the fuzzy c-means clustering algorithm is applied to the feature vectors stored in the features database. After extracting the features for each image in database and saving the extracted feature vectors in feature database, the latter is portioned into a number of groups each with similar features. Then, each cluster center is stored into a master cluster. The distance is calculated between the query image feature vector and the clusters' centers in the database. If the distance is minimum compared to all cluster centers in the master cluster, then the query image is similar to the images in that cluster.

The proposed system steps
Convert the image from RGB color space to HSV color space B. Feature Extraction  Feature extraction is carried out using spatial features such as region-based histogram, autocorrelogram, row sum and column sum, and texture features using GLCM.  The extracted features for each image are registered with the corresponding feature vector in a CSV file.  All the feature vectors are forwarded to fuzzy c-means clustering algorithm.  The centroid for each cluster is put into a Master-cluster.

Similarity Measure
The similarity between the query image and the images in the database is found out using Euclidean distance. The distance between query image and database image is calculated by the following formula

Experimental results
16 clusters selected to form the image database were used to test the system, which represents 20% of the image database. Many experiments with different combinations of spatial features were conducted on these clusters to reach the acceptable accuracy and speed. Based on these experiments, the weakness and strength points of the system will be explained. The precision, recall and accuracy of each cluster were calculated. The next subsections will show the results of the implemented experiments

Experiment using Histogram and GLCM features combination
In this experiment, a combination of three dimensional HSV color histogram for five segments and texture features extracted using GLCM was experimented. The results are shown in Table-2. In this experiment, a combination of three-dimensional HSV color histogram for five segments and texture features extracted using GLCM were experimented as in the first experiment. A gray level row sum and column sum were also added to the tested features combination. The results of the current experiment are shown in Table-3.

row sum and column sum, GLCM, and correlogram features combination
The current experiment tested the combination of features that were used in the previous experiment. In addition to the previously extracted features, a correlogram was added to the combination. The correlogram was computed for quantized HSV color space. The results of the current experiment are shown in Table-4.

Experiment using Histogram, row sum and column sum, GLCM, and auto-correlogram features combination
In this test, a combination of a gray-level histogram for nine segments, gray-level row and column sum, gray-level auto-correlation, and GLCM, was used to extract the texture features. The gray-level color space was used in order to decrease the feature vector size. The results of the current experiment are shown in Table-5. 2122 feature extraction. Also, the row sum and column sum features were extracted, but with an RGB color space. For each channel, a row sum and column sum features were calculated. The results of the current experiment are shown in Table-6.

row and column sum, GLCM, and auto-correlogram features combination
The current experiment is the one selected for implementing our proposed system. In this experiment, a combination of three-dimensional HSV histogram for nine segments, row sum and column sum for RGB color space, HSV auto-correlogram, and GLCM for texture features are extracted. The results of this experiment are better than those of the previous experiments. The results are shown in Table-7 below.

Result Analysis
In the previous section, a spatial feature descriptor (3D HSV histogram, row sum and column sum, auto-correlogram, correlogram, and GLCM) was extracted with different color spaces and different quantization schemas. Six experiments with different combinations of the extracted features were conducted until reaching the desired result. At first, the time needed to extract the features for one image in each of the six experiments is shown in Figure-  The change in time can be very clear in experiments 4 and 6. This change is attributable to image segmentation methods. In experiments 1, 2, 3 and 5, five segments were used to well extract the histogram, whereas in experiments4 and 6, nine fixed size segments were used. The time needed to segment the image into five regions was seven seconds. Thus, using fixed size segmentation reduced features extraction time.
The best retrieval result was obtained when using 3D HSV histogram from nine fixed segments, RGB color, row sum and column sum, HSV auto-correlogram, and GLCM. While the worst performance was given by the first experiment when using 3D HSV histogram and GLCM. From the experiments, it is clear that using spatial features will give a good result, since spatial features give information about the relationship between pixels. It is clear that adding row sum and column sum increases the accuracy of the results. In this proposed system, twelve GLCM were extracted to analyze the texture feature using four orientations (0 o ,45 o ,90 o and135 o ) and three distances (1,4 and 8) between pixels in each direction. These GLCMs will describe the texture of the image as much as possible. Also, this will increase the efficiency of the system.  The size of the feature vector was increased by using the correlogram, requiring more processing time. Thus, we experienced the auto-correlogram, and this gave us similar results and effects on the performance of the system, but with lower feature vector size.
Normalizing the data in the feature vector also will increase the performance of the system. At first, the features' vectors data were tested without normalization. It was found that the results sometimes depend on big values in feature vector such as the values of the histogram that represent the count of specific colors in the image. Normalization limited the values between 0-1, so that no data will be having big values that will affect the computations.

Comparing the results with other works
In this section, the results of the current proposed system are compared with previous related work. The comparison is illustrated in Table-8.

Conclusion
In this paper, a CBIR system was developed using spatial features and clustering approach. The system was designed to search both texture and non-texture images. Using 3D HSV histogram showed better results than the other histograms since this type of histograms has many features. The first feature is that it uses the HSV color space that is the closest color space to human perception. Second, it computes the histogram for a combination of three channels H, S and V, which gives more information about color. In Addition to the previous features, the histogram was calculated for nine image segments. This will give spatial information about color. The GLCM calculated the texture features depending on the spatial relationship between the image pixels. The results showed that the spatial features make the system's result more accurate and that the color and texture features with spatial information are sufficient, with no need to use more classifiers. Using FCM clustering will reduce the searching space and speed up the system processing. The experiment finds that the clustering algorithm also increases the retrieval system accuracy. The returned images may be the exact match to the query image, since the searching space was limited in image groups having similar