A novel approach to predict Water Quality Index using machine learning models: A review of the methods employed and future possibilities

The development of computer models for water quality index forecasting has been a leading research topic worldwide which has been considerably recognized over the last two decades; the balance between efficient water quality requires a good water management technique. The balance is said to be achieved through various procedures many of which require the application of computer-aided forecasting tools. In this paper, a decade research review on the water quality index in the field of artificial intelligence was carried out with the aim to present the most viable or most suitable methods and models to be adopted for future researchers in the field of water quality. The review incorporates the developed models such as ANN, ANFIS, SVM, other regression or time-series, and other soft computing models. This research shows that the study focused on a decade review of the methods and models, and also, there is room for long-term forecasts. It also shows that there is no single AI model that outperforms all the remaining AI models but It is necessary to evaluate the strength of each model combination for each region thus to know what type of method or model that works best for the country or region. The use of AI has grown significantly in recent decades however there is enough room for researchers to duel in and improve in the field of water quality index.


Introduction
The concern for clean and drinkable water is quite essential for health, water resources and environmental purposes [1]- [7].The demand by billions of people and numerous industries for clean, safe and adequate freshwater on the planet encouraged the practioners and research communities to be much engaged in prediction and monitoring of water quality and to address this universal concern [8].However, statistics as in 2015, that 29% of the world population does not have access to safe drinking water (WHO/IWA, 2017).As Groundwater being one of the main sources of water worldwide (65% of the groundwater is used for drinking purposes, 20% of it is used for irrigation and livestock, and 15% for industry and mining), the quality of the groundwater is dependent on the interactions within the aquifer between chemical constituents, soil, rock, and gases [9], [10].
The Water Quality Index (WQI) is a classification indicator that has been widely used for both the ground and surface water in determining the quality of the water which is essential and also plays a significant role in water resource management [11]- [13].WQI is a certified method to determine the quality of water by a simple way, that can respond to the various changes in the basic characteristics of water, using the relative weight for each parameters and classify the waters according to the quality categories levels [14]- [17] .The Application of the WQI was first presented by (Brown et al., 1970;Horton, 1965) And adopted and modified by countless researchers (Cude, 2001;Pesce and Wunderlin,2000;Said et al., 2004).All these applications came from the general concept of Water Quality Index proposed by the National Sanitation Foundation (NSF) of the United States, and the NSFWQI [18] is the world's most commonly adopted method.
Artificial Neural Network (ANN) is an AI-based approach that not only proved to be effective in handling large amount of dataset, complex nonlinear input and output relationship but also flexible and powerful computational tool [19], [20] .Over the past years, AI in the form of deep learning and machine learning models, have been increasingly applied to solve various environmental engineering problems and has been proved to be very effective, including river water quality modeling [21]- [23].
The main objective of this paper is to evaluate and review the application of Artificial Intelligence in determining water quality index which will be able to provide some guidance to the upcoming researchers.A decade review will be carried out to provide the best model to be used globally or in a particular country.The performance of the models using performance criteria's will be compared to other models (single or hybrid) and outperforming model will be suggested in the particular region, state or country.

Artificial Neural Networks
An artificial neural network (ANN) is a model that comprises of interconnected mathematical nodes or neurons similar to the biological neurons to form a structure and can also relate input and output parameters [24].Usually, signals are fed through the neuron to form a net input into another neuron, and the output layer which equals the number of desired output is determined by weight and bias associated with links among the neurons [25] (see, Fig. 1).Because of a range of variables, ANN is an exceptional predicting tool.The ANN, according to Adamowski et al. (2002), allows the application of information to predict future variables of possibly noisy multivariate time series [26]- [31].The ANNs have been proved to be effective model in predicting highly complicated relationships, processes, and phenomena.They have been successfully employed in a multitude of applications including river flow forecasting and also water quality index (WQI) [32].However, the use of ANNs to develop accurate models to predict WQI for rivers is an application which has yet to be investigated as evidenced by search in literature which lead to only two such publications: Juahir et al., 2004 andKhuan et al., 2002. [33][34] [35] developed a receptor model and then compared it to ANN and MLR in order to predict the WQI of a river in Malaysia due to agricultural practices along the sea which suggests that APCS-ANN can be used for environmental monitoring agencies in Malaysia.[36] Developed an ANN model in order to determine the WQI in Euphrates River which has six sampling stations and then compared it to MLR model, comparison of the results showed that the ANN has a higher degree of accuracy than the MLR.

Adaptive Neuro-Fuzzy Interference System
The adaptive neuro-fuzzy interference system (ANFIS), was formally referred to as the Takagi-Sugeno-Kang system, was created by Jang in 1993 and is the widely used fuzzy inference system [37].It comprises five components; input(s), a fuzzy system generator, a fuzzy inference system (FIS), an adaptive neural network, and an output [38].This technique uses the hybrid learning or back-propagation algorithm for training and combines the features of neural networks and the capabilities of fuzzy logic [39].
ANFIS hybrid model was employed to predict the performance of WQI namely ANFIS-FCM (Fuzzy C-Means data clustering), ANFIS-GP (Grid Partition), and ANFIS-SC (Subtractive Clustering); they observed ANFIS-SC to be the best performing model.[40].[41] developed a data driven ANFIS models for WQI in India, two different techniques were employed which are fuzzy C-means and subtractive clustering based ANFIS.Based on the performance of the models it was found that SC-ANFIS gave more accurate result than FCM-ANFIS.[42] Developed MLR, ANN, and ANFIS models were developed to predict the WQI for river Yamuna in India.The obtained results showed that the ANN and ANFIS outperformed the MLR model by 10% in the verification phase.Although the performance of ANN is slightly better than the ANFIS, but the ANN and ANFIS models outperformed MLR model in estimating the WQI.

Support Vector Machines/Regression
Support vector machines (SVM) are supervised machine learning models created which possess as powerful regression tool that has found applications in numerous prediction problems in various fields.In SVM, the statistical learning theory and structural error minimization principle are employed to map the initial training samples into a higher dimensional feature space through nonlinear kernel functions and the optimal solution is obtained by converting the problem to linear from nonlinear [43][44].
The Support Vector Machine (SVM) model is also believed to be a very powerful machine learning technique for both linear and nonlinear regression problems and has been used in various scientific issues with high prediction accuracy [40].[45] Studied the optimization of the SVM model to identify the major parameters that significantly affect the WQI and found that Nitrate is the major parameter for WQI prediction.In a study of prediction WQI in constructed wetlands, SVM and two other AI method were used and the results had shown that the SVM result predicted WQI with high accuracy than the two other models [26]- [31], [46]- [54].

Linear and Statistical Methods
Multilinear regression (MLR) is a statistical model that determines the relationship between a dependent variable and at least one independent variable [55]- [57].The main idea of the research by Fullerton Jr et al. (2016) was to determine the dynamics of water necessity for the city of El Paso (Texas, USA) using several predicting techniques, comprising the Linear Transfer Function (LTF) by Box and Jenkins (1976) [14], [57]- [60].[61] Adopted ANN and MLR techniques to predict WQI in Shivganga River basin India.The parameters such as pH, EC, TDS, TH, Ca, Mg, Na, K, Cl, HCO3, SO4, NO3 and PO4 Were considered for computing WQI.Based on the results obtained from the models, ANN model would become more beneficial in the prediction of water quality index in future.[62] Developed an MLR model to compare it to WQI of Tigris river, therefore the developed model of MLR can be used to predict and monitor the water quality of Tigris with reasonable precision.(A.P. Kogekar et al. 2021) Developed time series models Auto-Regressive Integrated Moving Average (ARIMA), Seasonal ARIMA (SARIMA), to predict the water quality index of the river Ganga, Further, only two important water parameters such as dissolved oxygen and biochemical oxygen demand, are considered for prediction and subsequently for the forecasting of the WQI.The result concludes that SARIMA predicts the water quality parameters as well as Water Quality Index (WQI) more accurately than ARIMA.

Evaluation of performance
For the development of any model, goodness-of-fit, error-of-fit, and biases are very crucial for evaluating the accuracy and precision of the computing approach.The metrics for appraising the model accuracy were the mean absolute deviation (MAD), mean square error (MSE), root mean square error (RMSE), mean absolute error, coefficient of determination (R 2 ), and correlation coefficient (R).The choice of these parameters stemmed from their application in numerous related studies as effective means of establishing the accuracy of a prediction model [26]- [31], [46]- [50], [52]- [54], [60], [63]- [67].Some of the most important evaluation criteria metrics that were used for evaluating the performance in this review are explained below; RMSE (RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.RMSE has been used as a standard statistical metric to measure model performance in various fields the formula for determining the RMSE is given below

R-SQUARED (Coefficient of determination)
R-Squared represents the coefficient of how well the values fit compared to the original values.The value from 0 to 1 interpreted as percentages.The higher the value is, the better the model is

MSE (Mean Squared Error)
MSE represents the difference between the original and predicted values extracted by squared the average difference over the data set it is also a measure of closeness of the estimator to the true value.

Discussion
It is important to mention that the reviewed study indicates that artificial intelligence has gotten a lot of attention in the field of water quality index recently.However, several AI models and optimization techniques has not been explored yet which leave a gap for researchers to explore in order to contribute to the development of water quality index.The study shows that majority of the researchers employed ANN model in their study and it also showed that various types of ANN models were used such as BPNN, FFNN etc. other types of ANN classification such as ENN, RBFEL received little or no attention in this field.There are also some advancements which showed that researchers began developing hybrid models to increase the prediction accuracy.In addition, ANFIS and SVM also showed a considerable attention in water quality index simulation.But in most cases the studies using ANFIS did not consider the various membership functions (MFs) which justifies the precision of the model.It is also important to mention that the classical approach no longer produce the best results hence researchers tend to start producing integrated methods or hybrid models which tends to give some higher degree of accuracy than the classical models.Table 1 shows the decade review for water quality index modelling and forecasting.[32] Malaysia  2 ,AAE,R,P,N ANN was applied to predict problem entailing use of archival measurements on WQVs of a surface water body for construction of a model capable of forecasting and calculating WQI.This approach can be used to any aquatic system worldwide.Therefore, empirical data analysis techniques such as the ANNs are recommended for analysis of long-term environmental monitoring records.
[33] Malaysia  2 ,RMSE A receptor model was developed then compared to ANN and MLR to a dataset from Kuantan river Malaysia in order to predict the water quality due to agricultural practices along the area.From the results it showed that the prediction of WQI values using APCS-ANN model can be used for environmental monitoring agencies in Malaysia.
[69] Pakistan  2 , ANN techniques were used to predict the water quality index in Pakistan, several models were developed and compared and the results showed that MLP had a high degree of accuracy compared to its counterparts [70] Mexico NSE Water Quality Index for monitoring and controlling shrimp culture systems using an analytical hierarchical process was developed which detects poor water.Based on the results it conclude that priority parameter assignment provides a more +effective water quality assessment than traditional approaches [71] Malaysia  2 ,RMSE,MAE Support vector machine (SVM) and two methods of artificial neural networks (ANNs), namely feed forward back propagation (FFBP) and radial basis function (RBF), were used to predict the water quality index (WQI) in a free constructed wetland.Based on the results obtained, the SVM and FFBP can be successfully employed for the prediction of water quality in a free surface constructed wetland environment [36] Iraq R, RMSE, MAPE ANN was employed to predict water quality in Euphrates River, MLR was used to obtain a set of coefficients for a linear model, Six sampling stations along the river were chosen.Comparison of the results showed that the ANN has a higher degree of prediction than the MLR.[72] Malaysia  2 , RMSE ANN technique was adopted to predict water quality index problem.Two methods were examined and compared, a WAMF and USHM.Models were trained and tested on land use and water quality.USHM model performs somewhat better and is more generalizable than the WAMF WQI model.
[73] Iran MSE Bayesian regularization and Ensemble averaging models were developed to predict WQI with respect to the concentrations of 16 groundwater quality variables collected from 47 wells and springs in a city in Iran.
Comparison among the performance of the models shows that the Bayesian regularization performed better.
[74] Iran  2 , RMSE DELPHI and CCME methods were applied to calculate the average WQI data of every month of a year.The WQI was formulated by both DELPHI and CCME techniques and the RMSE, R2 and ADD were used together to compare the water quality performance of the CCME and DELPHI models.
The DELPHI method was found to have higher predictive capability than the CCME method.
[75] Malaysia  2 , MAE Gene expression programming and (ANNs) were employed to predict WQI in free surface constructed wetlands.17points were selected and were monitored for over 14 months and an extensive data set was collected for 11 water quality variables.The GEP was able to successfully predict the WQI with high accuracy than the ANN.
[76] China RMSE, A novel data-driven approach to forecast the water quality index of a station by fusing multiple sources of urban data.stMTMV model was compared to ANN, RC decay model, ARMA, Kalman filter, LR, LASSO.The experiments have shown that stMTMV has the best predictive accuracy than the other models.
[77] Malaysia  2 ,MSE FFANN was developed to predict WQI by excluding the biological oxygen demand and chemical oxygen demand as they cannot be measured in real time.From the results it showed that FFANN will be able to predict WQI with the exclusion of BOD and COD from the model input variables.
[78] Malaysia  2 , ,  Radial basis function neural network (RBFNN) and back propagation neural networks models, have been applied to examine and mimic the relationship of WQI with the other water quality variables in a tropical environment in Malaysia.The results achieved are positively promising with RBFNN showing high degree accuracy.
[41] India  2 , ,  A data-driven adaptive neurofuzzy system for the water quality index was developed from eight different monitoring stations in India.Two different techniques, fuzzy C-means and subtractive clustering-based ANFIS were adopted and their performance were compared.Based on the evaluations, it was found that the SC-ANFIS method gave more accurate result as compared to the FCM-ANFIS.
[79] Ethiopia  2 , ,  ANN was used to predict the WQI in the most polluted Rivers in Ethiopia from 27 sampling sites.In addition, to minimize time and effort burden of repeated WQI determination, a modelling approach based on ANN can be employed successfully for the determination of the water quality index WQI.
[23] India RMSE The Cascade and the Feed Neural network were adopted to predict WQI in India.The Cascade model have shown its ability to predict the WQI when the five parameters defined by WHO are employed which gives it more degree of accuracy than the Feed neural network. [

Conclusion
This review has presented an overview of the water quality index forecasting literature appearing from 2011 to 2021, in an attempt to provide some guidance, mostly for researchers and practitioners seeking to adopt methods and models suitable for addressing planning-related decisions that dependent on future levels of water quality, it should also be used to address short, medium and long-term planning and decisions for researchers seeking to improve the predictive models.
Up till the present time there is no global model that outperforms all the models in water quality index forecasting, but each country has their own model that performs better than other models.Another point concerns the performance of hybrid models which shows more excellent result when compared to the classical models.
Although major advances in AI methods have been made recently, but there is still no new method, such as deep neural networks, has emerged as the best forecasting model.Therefore, water quality index forecasting still remains a research problem, which makes room for researchers to develop hybrid or specific methods for specific applications.

Figure 1 A
Figure 1 A three-layer ANN structure

Table 1
A decade water quality index forecasting methods according to the referenced literature ,  MLR, ANN, and ANFIS models were developed to predict the WQI for river Yamuna in India.The obtained results showed that the ANN and ANFIS outperformed the MLR model by 10% in the verification phase.Although the performance of ANN is slightly better than the ANFIS, but the ANN and ANFIS models outperformed MLR model in estimating the WQI.