Ensemble classifiers for detection of advanced persistent threats

The demand for application of technology in almost all walks of life is in the increase and can be seen to be geared by the paradigm changes in industrial revolutions (current 4.0), IoT/IoE (Internet of Things/Internet of Everything) concept, Internet 2.0, Artificial Intelligence (AI), BYOD (Bring Your Own Device) to mention a few but not without their increased inherent vulnerabilities and exposure to sophisticated and dynamic awaiting threats. Advanced Persistent Threats (APTs) among other malwares are some of the malicious attacks given serious attention as they have shown some level of complexities thereby causing defender solutions to poorly detect them. Poor APT attack tactics understanding, insufficient network traffic log analysis and poor classification are some of the problems identified for poor detection of these attacks. Network traffic logs are used by researchers to analyze the network and track attacks as packets move across network nodes. This research studies attack modelling in order to understand APT attack tactics and generate their dataset through simulation as well as a real dataset for normal operation. The experiment will be simulated on a virtual environment using dimensionality reduction technique on the network traffic log for improved log processing. To improve the APT detection accuracy flawed by their stealthiness, the ensemble of classifiers (Support Vector Machine, Random Forest, Decision Tree) with majority voting is used for better attack classification which resultantly gives a better detection accuracy of 90.47%.


Introduction
There is a growing demand for technology application and development in almost all walks of life leading to flexibility of platforms (hardware, software) that run their day to day operations. This has witnessed an increase in the use of mobile devices, cloud computing and company policies like BYOD (Bring Your Own Device) as a form of support or use for getting works done either onsite or from some remote locations (Rashid et al, 2014). These migrations and developments seems amazing but not without their increased vulnerabilities and exposure to attacks. Again, devices (Routers, Firewall etc.) that enable establishment of communication/access checks for these infrastructure are most times not properly configured, prone to vulnerabilities and or allows access due to trust thereby exposing its asset to possible threats (Randy, 2017;Rashid, et al, 2014). A typical example of how attackers exfiltrated data from their target stealthily would be to use among other means Internet Control Message Protocol (ICMP) echo request, Alshamrani  codes such that systems find them unclear, unreadable and therefore are not able to determine what they are meant for in some cases while in other cases they present as genuine code but are actually concealing malicious codes (Cert-UK, 2014; Binde et al, 2011). This increases the chance of the malware to propagate the system and span longer periods of time without being detected.

Internet Control Message Protocol (ICMP) Traffic
ICMP is part of the Internet Protocol suite along with Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) etc. Unlike other protocols, ICMP is not for data exchange rather it is used for establishing the status of the process of end to end communication in a network (Daniel, 2018). It has no dealings with any form of movement of data end to end between devices or any means that needs such services to achieve their objectives or meet their goal. A typical example of the usage of ICMP would be with the common achievement of the use of the PING tool in network communication (Shick and Horneman, 2014). The PING tool is actually used in network communication to confirm and establish fact for the connectivity of devices in a network using ICMP protocol to confirm that a particular computer whose IP address is being PING can respond to a message or determine if they are online.

Related work
The use of an anomaly based machine learning system, Machine Learning Advanced Persistent Threat (MLAPT) was proposed by (Ghafir et al, 2018). It is a phased system that uses Threat detection (8 different methods for various APT attack stages), Alert correlation (correlates alerts from the threat detection phase to determine the particular APT attack stage they belong thereby reducing false alert) and Attack prediction phase (correlates results from the alert correlation phase) to predict an early APT attack before its cycle is completed. The Alert correlation phase uses three steps Alert filter to filter alerts for redundancy and reduce noise, Alert clustering to collate related alerts and Correlation indexing which helps to determine how close alerts are in their individual clusters. The attack prediction phase uses Decision tree learning, Support Vector Machine, K-nearest neighbours and Ensemble Classification algorithms to train the prediction model so that the best one with prediction accuracy is considered for use. The solution is able to predict attacks based on the machine learning dependent already known record of monitored network. As against their comparators Brogi and Tong (2016); Giura and Wang, (2012) this work is an autonomous system because it has a phase that can generates its own detection events. Their result from the research saw a reduced case of false positive rates. Its shortcoming is poor coverage of the APT attack lifecycle therefore a need to include more detection modules to cater for that.
In order to use the ensemble of classifiers to improve detection accuracy of novel attacks, Prusti and Jena (2015) in their work supported the fact that single classifiers are not sufficient for improved detection accuracy. The objective was to obtain an improved detection rate with lowered false positive rate and at minimal cost. They used a predictive model based on ensemble to classify normal and attack classes. For ensemble, they used Support Vector Machine, Decision Tree and Neural Network as their combination of base learners. They presented a dataset with 38465 instances using AdaBoost, Logitboost and Bagging ensemble methods with the majority voting combination rule and their result showed AdaBoost to emerge as the best with 97.44% accuracy. Another work on ensemble is that by Mkuzangwe and Nelwamondo, (2017) where they proposed the use of Adaboost (using weighted majority voting combination rule), decision tree (decision stump) and the information gain concept. The idea was performance bound such that the average information gain associated with the features used in building the ensemble is obtained and used a measure for the classification accuracy of their work. The Network Intrusion Detection system was launched using NSL KDD dataset filtered for Neptune and normal connections with classification of both types of connections in perspective. Ensemble method was also used by Sornsuwit and Jaiyen, (2015) in their work on detecting User to Root (U2R) and Remote to Local or User (R2L) attacks. They aimed at removing redundant features to improve their dataset, decrease false alarm rate as well as increase detection accuracy for the attacks in question. Naïve Bayes, Decision Tree, K-Nearest Neighbour, Support Vector Machine and Multilayer Perceptron were used as weak learners with Adaboost ensemble method. Their result showed Naïve Bayes and Multilayer Perceptron with the best result for sensitivity and specificity respectively.

Support Vector Machine
Support Vector Machine (SVM) is one of the many machine learning techniques deployed for providing classification and regression solutions to mention a few. The principle of this technique relies on optimal hyper-plane in a highdimensional space. The goal of support vector machine is to design a hyper plane that classifies all training vectors into two classes such that the best one is the hyper plane that leaves the maximum margin from both classes. The following constraints shows how training data are classified for a binary classification.
For all x elements that are members of a class +1, the following constraints are satisfied: For all x elements that are members of a class -1, the following constraints are satisfied: Following the above, the aim of SVM is to determine the optimal hyper plane define by w T x + b = 0 which maximizes the margin of the two conditions presented in the preceding statements. Having found the optimal plane, the decision function is defined as f(x) = sign(w T x+b).

Random Forest
Random Forest is a classifier which comprises of a collection of decision trees such that for the resulting class to emerge, prediction will depend on the votes received from constituent trees that make up the forest. The emerging model is formed as a result of algorithm obtained from a collection of trees or better still forest of trees where the root node and constituent internal nodes represent the input variables. The available data is represented in a tree form or order thereby making it a lot easier to interpret. The aim of this technique is to have a model that can make prediction based on the provided class attribute or label (normal or attack for this work).

Decision Tree
In other to make decisions, a Decision tree will leverage the formation of tree structure with the leaves as nodes such that possible solutions are spread across and tends towards the root so as to follow the most efficient possibility. Its learning algorithm is described as below:  Choose the attribute that has the highest information gain

Dimensionality Reduction
This technique has been widely used for network traffic data preprocessing in order to come up with an ideal output for use with various machine learning processing applications. Following the redundancy found in input data, smaller set of new variables can be found in them such that each is a combination of the input variables that have the similar information as the input; this technique forms the dimensionality reduction process. Principal Component Analysis (PCA) is a statistical dimensionality reduction technique. The main purpose of this is to find a new coordinate system in which the input data can be expressed with many less variables without a significant error (Sorzano et al, 2014

Ensemble Classifiers
When a collection of selected Classifiers are trained at the same time to provide solution to an identified common problem and their outputs are combined or aggregated to improve accuracy, the process is referred to as an ensemble method (Aburomman and Reaz 2017; Sornsuwit and Jaiyen, 2015). Under certain conditions where the Classifier output are independent on each other and make errors in an independent manner it is possible that combining the output of several classifiers, we can get a resultant classifier which is better that the constituent Classifiers. The multiple learners will have different decisions and therefore they can be combined by several available ways to determine a particular decision. There are two (2) processes involved in achieving this task where the first one is to make appropriate decision on the selection of ensemble of classifiers that are relevant and sufficient for the task at hand as well as their ability to be diversely used. This entails generating different base learners with different algorithms that will be used for the ensemble. The learners may make there different errors in the instance space but by combining them together, a stronger learner can emerge. The next step in the process would be to come up with a strategy to put the results or decisions of particular Classifiers together such that reinforces accurate decisions and subsequently incapacitates erroneous prone Classifications (Aburomman and Reaz, 2017; Prusti and Jena, 2015). This technique of bringing a selection of classifiers together has recorded successes having been implemented in the area of intrusion detection systems to enhance their performance (Mkuzangwe and Nelwamondo, 2017; Sornsuwit and Jaiyen 2015; Prusti and Jena 2015). By using ensemble, low bias and variance for individual learners is achieved and where both varies for low and high, a balance can be created between them. By combination, statistical, computational and representational issues arising from training data and the hypothesis space can be reduced thereby flawing the potential of choosing the wrong hypothesis as with single classifiers. Ensemble method has been applied for prediction on other domains likes credit card fraud detection, weather forecast aviation and medicine to mention a few.

Figure 3
Ensemble of Classifiers Architecture (Sornsuwit and Jaiyen, 2015) Figure 3 show the architecture of ensemble of classifiers such that the collection of weak learners X which constitute inputs are combined to form a stronger one. In the diagram, there are weak learner L1, L2, L3…Ln forms the set of inputs X and the output of the process is Y which represents the stronger classifier a process which is carried out by way of voting. Voting methods can be through majority, plurality, weighted or soft voting. Majority voting is the commonly used method as classifiers will vote for a particular class label such that the resultant one emerges as having received over half of the entire votes. In any case that no class label gets more than half of the votes, they are rejected and no prediction would be made. On the other hand, Plurality voting adopts or considers the class label with the highest votes in count.
There is no rejection here as there would always be a class label with the highest vote count. Weighted voting allows single classifiers that have showed some level of variance in their performance to be combined for reinforcement purpose thereby emerging a stronger learner. In a case where single classifiers generate class probability outputs, Soft voting is adopted. Given that they are all presented with equal opportunities, soft voting would get their average and resultantly obtain a better one (Prusti and Jena, 2015).

Results and discussion
This section presents the results obtained from the individual classifiers as well as the ensemble. The tool used for performing the experiment is Weka version 3.8. The Voting algorithm was used for the ensemble and Support Vector Machine, Decision Tree and Random Forest where the individual classifiers applied. The voting algorithm works by using a set of classifiers or models whose predictions are combined in such a way their mean or mode is chosen or they are allowed to vote on the result will be. Majority voting was used as the combination rule for the listed classifiers to determine how the decisions of the models are combined to produce a result.

Support Vector Machine (SVM)
The following shows results obtained from using SVM classifier on the dataset. Table 1 shows the SVM result for 3148 instance where 60% for training and 40% for testing. It shows that 2843 instance where correctly classified at 90.31% accuracy while 305 instances were incorrectly classified at 9.68% accuracy. A breakdown of the analysis shows that 1871 attack instances were correctly classified as attack, 304 instances of attack were incorrectly classified as normal.
In furtherance to the analysis, 1 normal instance was incorrectly classified as an attack while 972 normal instances were correctly classified as normal.

Decision Tree (DT)
The following shows results obtained from using SVM classifier on the dataset. Table 2 shows the DT result for a total of 3148 instances where 60% for training and 40% for testing. It shows that 2847 instances where correctly classified at 90.43% accuracy while 301 instances were incorrectly classified at 9.56% accuracy. A breakdown of the analysis shows that 1874 attack instances were correctly classified as attack, 301 instances of attack were incorrectly classified as normal. In furtherance to the analysis, 0 normal instance was incorrectly classified as an attack while 973 normal instances were correctly classified as normal. The Confusion matrix for the result is shown here:

Random Forest
The following shows results obtained from using RF classifier on the dataset.

Ensemble of classifiers
The following shows results obtained from using ensemble classifier on the dataset. Table 4 shows the ensemble result for a total of 3148 instances where 60% for training and 40% for testing. It shows that 2848 instances where correctly classified at 90.47% accuracy while 300 instances were incorrectly classified at 9.53% accuracy. A breakdown of the analysis shows that 1875 attack instances were correctly classified as attack, 300 instances of attack were incorrectly classified as normal. In furtherance to the analysis, 0 normal instance was incorrectly classified as an attack while 973 normal instances were correctly classified as normal. The Confusion matrix for the result is shown here:   Table 5 show a comparison of the results obtained with applying PCA before running ensemble and PCA after running ensemble. It shows that there is a slight change in the accuracy obtained by using PCA on the dataset before using them for SVM, RF, DT and ensemble classifiers. From the results presented, the model recorded 90.47% accuracy on its ability to classify an attack and normal instance of a given network traffic. This result shows an improvement upon the use of the ensemble classifier with Random Forest taking the highest vote from the ensemble using majority voting algorithm for the combination rule. In addition, having applied the dimensionality reduction using the principal component analysis technique which for this work has presented in section 3.4, an improved result can be seen as presented in table 5 The main purpose of this is to find a new coordinate system in which the input data can be expressed with many less variables with less of a significant error. From the confusion matrix provided in the results, class imbalance constituted to the model's capturing of a fraction of the attack class instances as normal.

Conclusion
ICMP protocol is a benign connection for testing connectivity to nodes on the network therefore is allowed to pass by Firewalls. This protocol has been exploited for exfiltrating data as shown in this work. Network traffic logs are large in volume and carry information that can be used for detecting attacks if properly analyzed and processed. Using PCA, attributes of the log can be meaningfully reduced to produce a better data which can be fed for further machine learning processes. If attacks and normal network traffic can be properly classified, then the accuracy for detecting an attack would have been improved as the detection system can tell the difference between an attack and a normal traffic. Single classifier can perform low in terms of classification accuracy therefore the need to use ensemble of classifiers for making a decision on the best predicted class to choose from all individual classifiers.
APTs can adopt so many techniques to achieve carrying out their attacks without being noticed by available detection systems. This work used ICMP echo request as a case. In this regards, this work suggest the following for further research: