Diagnostic of pathology on the vertebral column machine learning-Cluster K-nearest Neighbor (CKNN) part (I)

In this investigation, we have developed a graphical user interface application to perform the diagnostic of pathology on the column vertebral based on the Cluster K-Nearest Neighbor (CKNN) classifier. The system is implemented and simulated in Anaconda, and its performance is tested on real dataset that contains 6 features and two (02) classes. Each class, abnormal and normal class consists of 210 instances, and 100 instances, respectively. A comparison of the performance of the test measurement under various test sizes (10%~50%) is carried out to predict the class label when the nearest neighbor k changes from 1 to 19. The results show that the accuracy depends on both independent parameters, the test size and k-neighbors, which gives better training accuracy than the test accuracy, in the range of [82.5% ~ 100%] and [70%~84%], respectively. When k varies from 1 to 4, a higher training accuracy, larger than 90% is observed. While the test set shows a low accuracy in the range of [74% ~ 82.5%]. Increasing the test size or/and k, does not affect significantly the accuracy. When k is larger 1, the training accuracy is approximately equal to 0.925±0.05, the test accuracy (except for k=6 and 17) is about 0.79±0.05. The prediction of the class status maybe optimized by combining the dataset set size with the k-neighbors parameters. The GUI can be useful to help the medical doctors to diagnostic the patient effectively to take a rapid decision and predict results in a reduced time lapse.


Introduction
Machine learning and artificial intelligence growth are improving many research projects in the medical field from more than a decade with an approach based on the integration of pre-existing data to make a diagnosis, take a decision and predict results in a reduced time lapse.
The K-Nearest Neighbor (KNN) is one of the most used and successful types of machine learning [1,2].The k-NN algorithm is the simplest machine learning algorithm and it is good for small datasets. The classification is based on majority of k-nearest neighbor category, the majority vote among the classification of the k objects and on memory. A model is built on the training data without using any model for fitting. It consists only of storing the training dataset. The model is able to make a prediction for a new data point, unseen data. The algorithm is able to find the closest data points in the training dataset. The KNN algorithm uses neighborhood classification as the prediction value of the new query instance [3]. If a model is able to make accurate predictions on unseen data, it is able to generalize from the training set to the test set.
The vertebral column or spine is a resistant and flexible articular bone chain that is attached to the skull at its upper extremity and to the pelvis at its lower end. In addition to its role of protector of the spinal cord, it allows statics and locomotion.
The spine comprises 33 vertebrae stacked vertically on top of each other, formed by a movable column of 24 vertebrae and a fixed column of fused vertebrae (the sacrum and the coccyx). The vertebrae are connected by facet joints at the back of the spine. These joints allow movement between the bones of the spine. The spine is stabilized by ligaments and are separated by an intervertebral disc located between each vertebra serving as a shock absorber formed by a Fixative fibrous ring and a central pulposus nucleus. The spine is divided into 05 parts: 1)cervical formed by 07 cervical vertebrae; 2)thoracic formed by 12 thoracic vertebrae; 3)lumbar formed by 05 lumbar vertebrae; 4)sacrum formed by 05 fused vertebrae; and 5) coccys formed by 03 to 04 vertebra.
Degenerative pathologies of the vertebral column represent a non-negligible part of the activity in neurosurgery and spine surgery, in particular lumbar pathologies which are a frequent reason for consultation and leading the neurosurgeon to have to make rapid and effective decisions allowing the patient returns back to his activity as quickly as possible. Sometimes the decisions on the pathology are obvious but sometimes it is more difficult to make the right choice in complex cases.
Vijayalakshmi et al. [4] proposed a pattern recognition system to identify the pathologies of the disc hernia and Spondylolisthesis using the kNN machine learning algorithm. The experimental results showed that the system was accurate in achieving a success rate of 88.31%.
Handayani I. investigated the dataset Vertebral Column by applying K-NN algorithm for classification of disk hernia and pondylolisthesis. The author results showed that the accuracy of K-NN classifier was 83% and the average length of time needed for this classification in carrying out the classification process was 0.000212303 seconds [5].
The purpose of our work is to introduce technology and artificial intelligence methods to neurosurgery to reduce the neurosurgeon's thinking time with the capability of automatically decide if a patient has a normal or an abnormal lumbar spine and to hold the decision on the difficult case. Our work focuses on the application of artificial intelligence to pathologies of the spinal column encountered in neurosurgery: disc herniation and spondylolisthesis according to biomechanical attribute. The data have been organized in two different classes. The task consists in classifying patients as belonging to one out of two categories: Normal or Abnormal based on the features. The following convention is used for the class labels. The categories Disk Hernia and Spondylolisthesis were merged into a single category labelled as 'abnormal'. The goal is to build a machine learning model, applied to data that can learn from the measurements of six(06) input variables whose features are known, so that we can predict the class for a new 6 input dataset, consists in classifying patients as belonging to one out of two categories: Normal or Abnormal class. This paper is organized as follows: Section 2 will present the vertebral dataset and defining different features. In the section 3, the experimental results and discussion will be presented, and finally section 4 will end the paper with conclusion.

Data set
The data we will use for this investigation is a secondary data source, the column vertebral Data Set, [6], which is a classical dataset in machine learning and statistics. This dataset is updated by the authors, replacing one inaccurate input data (degree_spondylolisthesis= 418.5430821, row =id=116) by 41.85430821. The short description of the dataset is reported in reference [6] with the total of 310 instances, eight (08) features and two classes. The abnormal class consists of 210 instances, while the normal class contains only 100 instances are used to carry out the experiment. The eight (08) features, denoted by(X1, X2,...X8) and two responses (or outcome, denoted by y1 and y2 , abnormal class and normal class, respectively ) to build our model, making this model supervised learning task. Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine as indicated in Figures  In this problem, we want to predict one of one option of the pathology using two (02) classes, abnormal and normal class. Every 6 attributes in the dataset belongs to one of these two (02) classes. This is an example of two classification problems. The desired output for a single data point is the class status of this dataset. For a particular data point, the class with defined range it belongs to, is called class 1 or class 2.
From this dataset of measurements, we want to build a machine learning model so that we can predict the pathology of a new set of measurements of patient, making this model supervised learning task. Supervised learning algorithms are usually applied to data that contains label information (class target name). The outcome y1 and y2 are based on the input data list of 6 strings. Each class is defined by the minimum, maximum and the range (Maximum-Minimum) of the features as indicated in Table 1 and displayed in Figures 5a and 5b. Both classes shows almost the same minimum value for the features (X1, X2 , X3 , X4 ,and X6), while the maximum value of the abnormal class 0 for each feature is larger than the normal class1.    The output of confusion matrix is a two-by-two array, where the rows correspond to the true classes and the columns correspond to the predicted classes. Table 2 illustrates this meaning: by computing accuracy, which can be expressed as One way of deciding which performance measure is suitable for the task is to consider the confusion matrix. A confusion matrix is a table of contingencies; in the context of statistical modeling, they typically describe the label prediction versus actual labels. It is common to output a confusion matrix (particularly for multiclass problems with more classes) for a trained model as it can yield valuable information about classification failures by failure type and class.

Graphic User Interface
To integrate the module of the classifier with the patient database, Graphic User Interface (GUI) was developed by the authors (Figure 6). The user can insert 6 features (float value), the test size (float value) and the k, neighbor parameter (integer) as input for the program classification. The minimum and the maximum value for each feature, the test size (which is limited to 9 values with the step 5%) and k -neighbors with the step 1 are indicated on the GUI. The program will assign the input data to a respective class with accuracy larger 80%, and displays all the retrieved information of the patient after clicking on the click button. If the input data of a new patient is in the range of the data set, the status will appear to be either abnormal or normal as indicated in Figure 6.

Figures 6 GUI shows the class status 1 and 2 and the features information
After achieving good result for testing, all the trained data for the selected dataset was saved to be used for classification process. These data can be called back in the program. For a given input, excluding the training and testing procedure, the classification processing time takes about few seconds. The time refers here to the time to be taken to assign the input data of 6 features (without including the processing time) to determine the class status output. The experimental results show a maximum accuracy, larger than 80 % for the test set in the range of k = 2, (4 ~19), larger than 90% for the training set when k = (1~9), and 11. The class label extraction of the test data succeeds in 83% (k=5,6, test size=15% ), 82%(k=2,9 test-size=35%), and 84% (k=16, 19, 10% ). While the training set shows a higher accuracy 100% for all the training sizes when k=1, larger than 91% when k=2, and larger than 90% when k=3 and 4.
In the range of k [2~10], the training accuracy varies from 86% to 92.5%. While the test size shows lower accuracy in comparison with the training one in the range of [70% ~84%]. Considering a single nearest neighbor, (k=1) the prediction on the training set is perfect. But when more neighbors are considered, the model becomes simpler and the training accuracy drops. The test set accuracy for using a single neighbor is lower than when using more neighbors, indicating that using the single nearest neighbor leads to a model that is too complex. On the other hand, when considering more than 10 neighbors, the model is too simple and performance is not worse. The best performance is somewhere in the range of [2~10].

Figure 15
Accuracy versus k-neighbors. Test-size: 50% Figure 16 Test Accuracy versus k-neighbors, for various test size(10%~50%) Figure 17 Test Accuracy versus k-neighbors, Figure 18 Ratio versus k-neighbors, for various test size(50%~90%) for various test size (10%~50%) The ratios of the training size accuracy to the test size accuracy versus k under various test sizes [10%~50%], are indicated in Figure 18.

Figure 19
Minmum and maximum of Test, training accuracy and ratio versu k.
Based on the simulation results summarized in Table 3 (represented in Figure 19) [4].

Conclusion
In this investigation, we build a statistical machine learning model based on supervised learning algorithms, applied to data set that contains two label information classes. GUI has been developed using KNN classifier to improve the efficiency of the diagnostic of pathology on the column vertebral. The working system was tested successfully, which diagnoses and recognizes the pathology on real data. The experimental results show a high accuracy, larger than 90% for the training and larger than 80 % for the test set. The class label extraction of the test data succeeds in 83%, 82% (k=2,9 test-size=35% ), and 84% when (k=5,6, test size=15% ), (k=2,9 test-size=35% ), and(k=16, 19, test size=10%). While the training set shows a higher accuracy for all the training size 100% when k=1, larger than 91% when k=2, and larger than 90% when k=3, 4. This model works well on the training set, but does not perform badly on the test set. But, still, it is good, which might still be acceptable that can learn from the measurements of six (06) input variables whose features are known. The test size combined with the CKNN method can be used to control the accuracy rate. Thus, we can predict the pathology on vertebral column for new six (06) input dataset with a higher accuracy. This application is faster which can reduce the heavy physician workloads and diagnostic time to make rapid and an effective decision.