Comparison of supervised learning methods for prediction of monthly average flow

Dugorocno planiranje hidrotehnickih sustava zahtijeva poznavanje dugorocne dostupnosti vode, najcesce u obliku srednjeg mjesecnog protoka. Uglavnom se koriste znanja iz stohasticke hidrologije, a moguci scenariji dobivaju se generiranjem sintetickog protoka. Raspolaganje klimatskim modelima namece mogucnost modeliranja iz buducih scenarija, a pretpostavka u radu je da se za tu svrhu može primjenjivati nadzirano ucenje. U radu je analizirana preciznost tri modela nadziranog ucenja u tri pristupa i autoregresivnog modela u prvom pristupu, za predviđanje srednjeg mjesecnog protoka, a u ovisnosti o duljini povijesnog niza.


Introduction
Upcoming pressures on water resources like increasing of population, need for energy and food, demand increasing of efficiency and effectivity of production [1,2].Significant climatic variations and changes cause more often phenomenons of extremely wet and dry periods and change statistical distribution of hydrological events [3][4][5].Stochastic methods and supervised learning methods represent practical tool for simulation of river flow on hydrologically studied basins.The assumption is that it is possible to analyze present and future needs related to water resources systems by using of appropriate simulation models in water resources systems management if the long enough historical time series of measurements is on disposition.For example, building of quality simulation model is necessary for conduction of simulation-optimization procedure for analysis of availability of water for needs dependent on water reservoir [6].Therefore, predictions of mean monthly river flow monthby-month and long term planning are of great importance for planning and choosing of water reservoir regime.Analysis of acceptability of usage of historical flow time series in the dependence of length and data on disposition is given in the paper.Possibility of usage of autoregressive model (AR) and three supervised learning (SL) methods for prediction on the basis of flow, and the same three methods for prediction on the basis of precipitation amount and air temperature, is tested.Objectives of the analysis are: give answer on the question what is the minimum length of historical time series at which is acceptable to use mentioned methods and analyze possibility of building a quality model which could be used for prediction of flow from the results of climatic models.

Overview and conclusions from previous researches
Machine learning is used for finding of patterns in data and their generalization by induction.Supervised learning is the part of machine learning and artificial intelligence used for searching of parameters of hypothesis (function), based on given data (inputs and outputs) and assumed hypothesis, which results with the best predictions on unseen instances, for solving of problems of classification and regression.From the literature review it can be concluded that in hydrology SL is often used for needs of real time prediction (with time step to several hours) and for short term and mid term predictions (1-7 days), but rarely for long term predictions (1 month) and even less often for long term planning.Usage of smaller time step is interesting due to the presence of greater amount of data for model building and is relatively simple to build quality model without usage of external variables (hence, flow is predicted from flow by itself).On the other side, the model built in that way is not able to reliably predict several time steps ahead from the current step (due to error generation with increase of time steps number), except if timely averaged variables and external predictors (air temperature, precipitation amount, etc.) are eventually used as input variables.
The most popular SL model is artificial neural network (ANN) and is present in the vast majority of work in the subject area.Building of models with ANN is consisted of choosing the weights in synapses with objective of minimization of differences between desirable input and real input of ANN on the basis of chosen criterium and by learning from examples [7].Considering the reasons of continuous improvement of hydrological cycle, hydrologists used to set greater aspiration on physically based models through the history of modelling, which leads to the design of more complex models with time [8].Main advantages of ANN, for example, avoiding the problem of full understanding of runoff for hydrological modelling, which is complex on real spatial scale, have already been noticed in the last two decades.
There is no need for introducing the assumptions of linearity and describing complex relationships of different processes in detail, usage of data is more flexible, models can be built relatively quickly [9].Similar advantages are present in the application of other SL models.Support vector machine (SVM) is appreciated because of its generalization ability, strict theoretical basis, relatively simple usage, and robustness on the problems of regression and pattern recognition [7].Comparison of supervised learning methods for prediction of monthly average flow the precipitation by those models, runoff was predicted from the precipitation -by using those models and by using runoff coefficient.ARIMA gave slightly better results, but the approach with the runoff coefficient was more accurate [15].Terzi (2014) used GP (genetic programming) for prediction of mean monthly flow from precipitation measured at three stations and from flow measured at two stations, and compared the method with MLR [16].
On the subject area (Vinalić, Cetina) ANN was applied for the purpose of short term prediction of inflow in the work by Matić (2014).Through the different approaches in usage of input variables (inflow, precipitation, air temperature) and application of ANN, the problem of model response compared to the real event for prediction from 1 to 10 days was resolved.Time series models (prediction of inflow by inflow), rainfall-runoff models (prediction of inflow by precipitation or by precipitation and inflow) and multivariate models (prediction of inflow from precipitation, temperature, etc.) were compared.A direct and an indirect method were used for the prediction of inflow.While the direct method is used for building a separate model for every different time step, the indirect method is used for building a single model for all of the time steps and due to the error generation is less accurate than the first method.Time series methods were the most accurate, but the response problem had to be resolved.After the following steps: introducing of the precipitation frequency and accumulated precipitation, usage of adaptive neural model with submodels for different seasons and optimization of neural model, introducing the averaged variables, the accuracy was significantly increased.Model building and calibration were done on the data from 2007 th to 2011 th year (1862 instances of data), and verification was done on the data from 2012 th year (365 instances) [17].As a rule, longer historical time series are used in the literature, 20-40 years [10,16], 40-60 [11,12,15] and even about 100 years [13].
As the quality of SL models directly depends on the amount of data used for model building (probability to build a quality model is increased with the amount of used instances due to a greater possibility of generalization of laws in data patterns), it is interesting to test what length of historical time series is needed for building a model capable to predict outside of the time domain of historical time series, with satisfactory accuracy.According to the planning timeline there is a distinction between models for long term prediction (month-by-month) and for long term planning, while the purpose of this work includes development of the model for both.Models of time series were used (for one step ahead) and multivariate models (direct methods, considering that a single model learns general laws between input and output variables).The amount of instances was about 110-750 (historical time series from 10 to, orderly, 65, 62 and 60 years, years without flow measurements were not accounted), from which 60 % was used for building, 20 % for calibration, 20 % for verification, and the rest (from about 640 to 0 instances for 10 to, orderly, 65, 62 and 60 years) for additional verification of models.
Predictions of mean monthly flow by using SL are less frequently represented than predictions on shorter time basis, especially for long term planning (from the used literature, works [10,14]).According to the knowledge of authors, there are no works which analyze influence of historical time series length on the accuracy of SL, while the amount of instances in data directly influences the model accuracy.

Autoregressive model: Thomas-Fiering AR(1)
Stochastic processes in water resources systems management are often described by Markov processes, and for the application purpose an assumption of historical time series stationarity is introduced.Markov processes are discretized by discrete processes -Markov chains [18].General form of autoregressive models AR(p) of order r is [19]: where: z t is timely independent, normalized and standardized series, j i are autoregressive coefficients, ε t are timely independent variables.The simplest is the autoregressive process of the first order AR(1).For normally distributed monthly flows with mean μ, variance σ 2 , month-by-month correlation r Thomas-Fiering model AR(1) can be applied [6,18]: where: Q i , Q i+1 are mean monthly flows for i+1 st and i-th month, μ j , μ j+1 are yearly averaged mean monthly flows for j-th i j+1 st month, σ j , σ j+1 are standard deviations of the j-th and j+1st month (yearly averaged), ρ j are correlation coefficients of j-th and j+1st month, V i is randomly chosen variable from normal distribution with mean E[V i ]=0 and unit variance E[V i ]=1.This model is often used for synthetic flow generation and is able to preserve statistical similarity with historical time series (e.g.[6]).The model is applied in the first approach (chapter 3), and the procedure is written in the programming environment Python (www.python.org,[20]), which is also used for all the other models.

Artificial neural networks
ANN mimics the learning principle used in the brain, by using the assumption that the process of the learning is taking place through electrochemical activity in networks consisted of neurons [21].
The most often ANN is consisted of three layers: the first one characterized by nodes which are in fact input variables, hidden layer consisted of nodes with an activation function and the layer with output node -predicted value (e.g.flow).Possibility of varying the number of hidden layers and nodes refers to the fact that the process of finding the appropriate ANN architecture is complex task [21,22].The type multilayer perceptron, with three layers, for solving the problem of regression, is used in the paper.Differences between real and modelled values are minimized by stochastic optimization algorithm based on the first order gradients (Adaptive Moment Estimation -ADAM).The algorithm is computationally efficient, does not demand much memory and is suitable for great amount of data [20,23].Parameters which most significantly affect the quality of model building are number of hidden layers and number of nodes in the layer, activation function, learning rate and learning momentum, maximum number of iterations in error optimization and error tolerance.There are also some other parameters, but generally the choice of the input variables is the key step in applying SL.At ANN with three layers an activation a j (2) in the node j = 1, 2, …, l of the hidden layer (label 2) is calculated in the following way [24, 25]: where: g is the activation function, θ j,i (1) is weighted influence of input variable x i on the activation a j (2) , i = 1, 2, …, k.Index k refers to the numbers of nodes in the first layer, index i refers to the number of nodes in the hidden layer, and index 0 refers to the "bias" variable.Predicted value is calculated by the equation: (3)

Support vector machine
In the classification problem SVM for chosen function finds parameters with which the function is optimally distanced from different classes, while in the regression problem the procedure is used for finding the optimal way of describing the data with chosen function.The problem is often multidimensional (it can be seen in the chapter 3 that the mean monthly flow is described as the function of at least 6 different predictors) and complex for graphical representation.SVM considers data as support vectors and approximates them with given hypothesis by minimizing the error of predicted value approximation.Thereat within the defined margin, that is, error ε, there must be as much points as possible.Bias and tolerance of amount of deviation greater than error are estimated by trade-off parameter C, positive constant value which determines the degree of error penalization.Bias and variance are estimated through the minimization of the sum of regularization part and model building error in the equation ( 4) [26,27]: With conditions: where: x j are input variables, y j predicted variable, w vector from the space of input variables, b bias variable, ξ j , ξ j * slack variables used for estimation of deviation of input variables from margin.Hypothesis used for predicted variable approximation is [28]: where: α i are variables resulted from the transition to dual optimization problem, and K is the label for kernel.Programming environment enables choosing of function and kernel parameters (linear, polynomial and degree, radial basis function), and parameter C which affect the accuracy of prediction.

Nearest neighbours method
NNM uses the principle of searching for the set of values (in the part of data for model building) which are most similar to given ones (on the part of data for model prediction).For that purpose is needed to find distances between given and most similar points Comparison of supervised learning methods for prediction of monthly average flow (k nearest neighbours).Too small amount of neighbours implies that the model is of greater sensitivity, while too large amount implies smaller accuracy due to the influence of distant neighbours.
After the nearest neighbours are found, NNM calculates mean of the predicted values for every single neighbour [25,27].Defining of distance measures (Euclidian, Miknowski, etc.) in the paper did not have significant influence on the results accuracy.Beside the number of neighbours, weights of the influences of neighbours (uniform or dependent on distance) significantly affect the model accuracy.Programming environment enables choice between four algorithms for searching of the nearest neighbours: ball tree, kd tree, brute algorithm and auto choice of the best of those three.They are important because of computationally demanding calculation of distances between neighbours.Brute searches through the all possible options, which can last long for great number of neighbours, while other two use the logic of trees for searching.Kd tree is a binary tree which uses the logic of avoiding the calculation of distances for those points for which is known that they are distant (if the point A is far from the point B, and the point C is close to the B, then the C is far from the A).This algorithm is not efficient when D-dimensional measures for distances are used, if D>20 (the number of predictors is >20).The problem is solved by ball tree algorithm which, instead of using the Cartesian coordinate system, calculates the distances in the spherical coordinate system [27,29].It is important to consider this in every statistical analysis and also at applying SL due to its use of principle of learning from data.SL can cover changes in naturally present flows arose from building if enough number of instances for model building is present.It can be assumed that inflows from HS Vinalić 1 can be used for long term analysis of water availability.

Forming of the models
The first step is the choice of input variables, that is, predictors.Those are variables from which mean monthly flow is predicted, and three different approaches are used in the work -prediction of flow by using flow by using precipitation and temperature from one station (MMS Knin)  Comparison of supervised learning methods for prediction of monthly average flow calibration-verification were used for additional model verification.Therefore, in the second test, 5, 2 and 5 years for additional model verification was left, and in the last test, 55, 52 and 50 years (tables 4-6).

Statistical error measures
While optimizing the models the most attention was taken for achieving as high correlation as possible, as small root mean squared error as possible and as high coefficient of determination as possible.Correlation coefficient R represents interconnection between measured and predicted variable.Range 0-0.25 refers to weak, 0.25-0.6refers to mid strong, while 0.6-1.0refers to strong correlation [32].High values of correlation coefficient do not necessary mean that the built model is capable to generalize well.Therefore, other error measures were also used: root mean squared error (RMSE), mean absolute error (MAE), relative absolute error (RAE), root relative squared error (RRSE), coefficient of determination or efficiency (R 2 ).Due to the limited space in the paper only R 2 and RMSE were shown.The used R 2 is the measure of likelihood of predicting the values unseen by the model and is not necessary the squared value of R (there are more definitions) and can be negative if model predicts arbitrarily bad.The value 1.0 represents absolutely accurate prediction [24].Equations of mentioned measures can be found in the researches from the area (e.g.[10,13,24,33]).

The first approach
In the first approach it was shown that AR (1) 3) the analysis of all historical time series of different lengths was conducted.
Historical time series at building of the models were always split chronologically: the first 60 % of the data for a model building, the next 20 % for a model calibration and the last 20 % for a model verification.In the first test maximum amount of data on disposition was used for all approaches: by order, 65, 62 and 60 years.In the second test, data from last years were removed so, by order, 60, 60 and 55 years have been used.In each further test last 5 years were removed until 10 years of data were left.Data from those years which were not used for the procedure building-  Analysis of results shows that R 2 in the building and calibration part is satisfying only for AR(1) model (R 2 > 0.65), while for other models is in the range of medium strength (R 2 < 0.45), with exception of slightly greater values at model NNM in the building part (R 2 < 0.55).Verification for the most of the models is in the lower range of medium strength of the coefficient of determination (R 2 < 0.50), as also the verification outside the used historical time series length for all of the models.Based on mentioned, it can be concluded that this approach is not for recommendation, except with eventual introducing of the improvement by building hybrid models, for example by using singular spectrum analysis [33].As it is in the work, emphasis is placed on the long term planning, it is needed to use other approaches.Coefficient of determination and the root mean squared error of AR and SL models are given in the table 4. On the Figure 6 the graphical representation of measure   Comparison of supervised learning methods for prediction of monthly average flow R 2 in dependence of historical time series length for all the approaches is given.At AR(1) care should be payed to parameter t, which can be seen on the verification parts.According to the idea of the paper, models should be applied also for long term planning, and it is recommended to pay attention on the error measures on the verification parts.AR(1) does not use external predictors and is not applied in other two approaches.At SL models, NNM describes flows more accurately in the model building part, but accuracy is not preserved in the calibration and verification part.In the second and the third approach, as also in some occasions of the first approach, ANN and NNM give greater accuracy at the model building than SVM.But, at SVM the accuracy is preserved in the calibration and verification part.The most favourable combinations of error measures (greatest values of R 2 and lowest values of RMSE) are gained by model SVM, for every time series length.SVM in the second and the third approach shows also the lowest variability of error measures in dependence of time series length.ANN has got great number of options at parameters and architecture choice of the network and it is possible that, by exhaustive research, greater accuracy would be achieved, which can be timely demanding.

The second approach
In the second approach input variables were changed and models had different parameters than in the first approach.At NNM, by applying the weight distribution depending on the distance between "the neighbours", the flows used for model building are accurately described, while accuracy is reduced for calibration and verification.This is worth to notice, because a model that very closely approximates building data will not necessarily have a good generalization capability on other data.However, in this case, overfitting has not been achieved because equal accuracy has been achieved in calibration and verification (but not also in the model building) with equal weight distribution.The radial basis function kernel gave the highest accuracy of SVM.Correlation of all models, on the model verification and the verification outside the used historical time series length, is in the area of strong correlation (R 2 > 0.44), except for an NNM (R 2 = 0.26) model that refers to the shortest set of 2-year predictions.It is necessary to emphasize that for time series length from 45 to 55 years R 2 is higher for verification outside the historical time series length than for calibration and verification for all analyzed models.Extreme values are overestimated, or underestimated, for most part of the verification (high RMSE, MAE, RRSE, RAE).On the other hand, RMSE values are in the range of 4.9-7.05m 3 /s, indicating a significant increase in accuracy compared to the first approach where it was in range of 6.6-12.09m 3 /s.The ability of all models to globally describe the nature of the flow is high, even outside the used historical time series length.For models with smaller historical time series lengths, not

The third approach
In the third approach the model precision has been increased according to all statistical measures.When it comes to ANN, the rectification function was shown to be the best activation function.NNM model shows similar functionality as in the second approach.Also, ANN for 10-20 years has produced the perfect accuracy in model building, but significantly reduced in other parts.In the case of ANN, it is a result of overfitting and with the reduction of instances number for model building, additional energy should be used to find the appropriate architecture network.The SVM is the most accurate and shows the ability to maintain error rates low on all parts of the data.Correlation on the verification for SVM outside the time series length is in in dependence of time series length for the third approach is given on the Figure 10.Additional inclusion of data from nearby stations would certainly increase the precision of the model.However, the goal of machine learning is to build a good model with as few input variables as possible, mimicking realistic situations where significant number of nearby stations is rarely available.Including the number of days in a month with a certain amount of precipitation can also contribute to the precision.After determining the best configuration of the model, improvements can be made by spectral analysis, wavelet based methods, chaos analysis, phase reconstruction of space, etc. (see [12,13,29]).

Statistical analysis of results
Descriptive statistics of the model results are calculated and are given in the table 7.In the first approach significant underestimation of the maximum values of the ANN, SVM and NNM models is noticeable, while AR(1) shows overall minor  Comparison of supervised learning methods for prediction of monthly average flow deviation and a higher average flow rate.In the second and the third approach deviations are significantly reduced.In the second approach ANN shows the smallest deviations, while in the third approach SVM shows the smallest deviations.Attention should be also paid to the possible occurrence of negative flow values in ANN and SVM, although they are negligible in the most accurate model (SVM, third approach).For future research solving these problems by optimizing model parameters is suggested.
Scatter plot of modelled values related to the observed values for the third approach is shown on the Figure 11.Values from all parts of data for ANN, SVM and NNM models are separately presented with unique markers.The largest scatter is observed for the NNM model, except for the model building, whose values completely align to the line of the perfect agreement.
By increasing the measured flow rate, NNM significantly underestimates the predicted values.Less dispersion can be seen at the SVM and ANN models, but these models tend to underestimate higher flow rates.Therefore, for future research, it is recommended to calculate the model's reliability intervals and to integrate these values into model results, for example by using quantile regression (e.g.[34]).

Conclusion
The paper analyzes the possibility of predicting the mean monthly flow for the purpose of long-term prediction and planning in solving problems related to water availability.Three different approaches were used in which three SL methods were compared, with addition of the stochastic method in the first approach.In the first approach, SL was able to describe the general nature of the flow but with significant deviations in the area of extreme values.AR is capable to replicate full variability of the flow if proper attention is payed to methods used for quantifying flow variability.In order to use SL, more complex models and/or more informative input data must be selected.In the second and the third approach causality of input and predicted variables was described better with SL.The application of precipitation and temperatures for flow forecasting is favourable because of possibility to use projections from climatic models, which cannot be implemented in the first approach.It is generally valid that with a larger amount of data used to build an SL model, greater accuracy and precision are achieved.But the length of the time series does not necessarily reflect quality of built model.The precision of the SVM with the determination coefficient in range of 0.7-0.8 for the 20-40 years length of the time series is satisfactory, whereas for 10 years (the determination coefficient 0.67) the precision is not significantly lower.The recommendation for further research is to focus on the additional elaboration of the input selection variables methodology so that the available data is used more efficiently.

Figure 1 .
Figure 1.Structure of ANN with three layers

3. 1 .
Research areaMethodology and models were applied on the flows measurements of river Cetina from hydrological station Vinalić 1 (HS Vinalić 1).Historical time series of daily flows from 1946 until 2015 was on disposition, with gap in measurements from 1991 until 1997[30].Principally, those flows can be understood as inflows in water reservoir Peruća, but with dose of caution because the area is karstic.There are on disposition: accumulated daily precipitation (amount of fallen rainfall) and mean daily air temperature (further temperature) from main meteorological station Knin (MMS Knin, 250 m a. s. l.) in the period from 1949 until 2015, accumulated daily precipitation from precipitation station Vinalić (PS Vinalić, 350 m a. s. l.) in the period from 1951 until 2015 (gap in measurements 1991-1997), and mean monthly temperature from climatological station Sinj (CS Sinj, 308 m a. s. l.) in the period from 1949 until 2015[31].The overview map and the position of stations can be seen on the figure3, while mean, minimum, maximum and averaged monthly flow can be seen on the figure4.Forming of the water reservoir did not have significant influence on the flows of the HS Vinalić 1.

For
all approaches, modelled and observed flows for historical time series length of 45 years are shown (figures 5, 6 and 7), due to the satisfying accuracy (in the second and the third approach) and possibility of long term planning to 15 years.Examples where data is missing are represented by value 0.

Figure 5 .
Figure 5. a) building, b) calibration, c) verification and d) verification outside the historical time series length for historical time series length of 45 years, the first approach

Figure 6 .
Figure 6.R 2 on all parts of data in dependence of the used historical time series length, the first approach

Figure 7 .
Figure 7. a) building, b) calibration, c) verification and d) verification outside the historical time series length for historical time series length of 45 years, the second approach

Figure 8 .
Figure 8. R 2 on all parts of data in dependence of the used historical time series length, the second approach

Figure 10 .
Figure 10.R 2 on all parts of data in dependence of the used historical time series length, the third approach

Table 1 . Input variables used in analysis Table 2. Statistics of used characteristic quantities at all configurations Flows (Vinalić 1) (1946-2015)
The second and the third approach use exclusively external variables and are appropriate for long term planning.Quantities in the table 1 are defined as input data.Characteristic quantities shown in table 2 represent variables which were on the disposition for selection of the model configuration.E.g., at the first configuration one of the potential input variables is Q avmmin -minumum mean monthly flow averaged at all years.Of all of the monthly values, its minimum for period 1946-2015 is 0.56 m 3 /s, mean value is 2.70 m 3 /s, and maximum value is 5.51 m 3 /s, which is shown in table 2. Analogically, this also applies for other physical quantities at the second and the third configuration.In the programming environment the procedure for processing and preparation of the data for model building is written.For every approach correlation of potential input variables (table1) with mean monthly flow is analyzed.In the preliminary choice of input variables only variables with correlation of at least 0.55-0.60 were used.For yearly averaged variables, correlation with mean monthly flows for each particular year was considered, and for using in the preliminary choice it was needed to satisfy threshold in at least 30-40 % of historical time series.The result of procedure is time series for modelling procedure.The second step is preliminary building of the models.With obtained time series possibility of models AR, ANN, SVM and NNM for flow approximation is tested.Model parameters are preliminary