Issues of accounting for outliers in assessing of the nutrients runoff

. The research work is devoted to the analysis of the highest values of nutrient concentrations in the runoff in the Velikaya River for the period 1969 – 2009. According to the results of the study, when assessing the numerical characteristics of river water pollution, it is necessary to exclude outliers from the observation series. The presence of outliers in the calculation together with the main observational data leads to a significant overestimation of the numerical characteristics of river pollution in the average annual and multi-year period. In this case, it is necessary to study the characteristics of the outliers themselves, regardless of the initial samples. The paper presents an attempt to define the probability of outliers by combining outliers information on individual observation series.


Introduction
Usually, in the studies of river pollution, the scientists bring the most attention to the analysis of averaged values of indicators of the ecological state of the environment. In this case, possible deviations of the indicator values from the average value are either not taken into account at all or are not taken into account sufficiently [1]. Meanwhile, sometimes, these deviations determine the possible extreme values and should be taken into account in solving practical environmental tasks.
Thus, the study of the methods for identifying and assessing the significance of extreme values or the so-called "outliers" [1] is an important task in the practice of statistical analysis. In this regard, the main goal of the research is to analyze the outliers in the series of nutrient concentrations in the Velikaya river.

Materials and methods
The source of the Velikaya River is located near the village of Shepeli, it flows into the Pskov Lake, 4 km west of the village of Murovitsy. The river is 430 km long, the catchment area is 25,200 km 2 [2]. Three stationary observation points are installed on the Velikaya River. They located in proximity to the towns of Opochka, Ostrov, and Pskov. Each observation point has two sectionsabove (upstream cross-section) and below (downstream cross-section) the city line. The observation period under consideration is 1969 -2009.
We used long-term observation data on the concentrations of the following elements: nitrate-nitrogen (N-NO3 − ), ammonium nitrogen (N-NH4 + ), total iron (Fetotal), and the indicator of biological oxygen demand for 5 days (BOD5). The data was provided by the Northwest Department of Hydrometeorology and Environmental Monitoring.
The study used the methods of preliminary statistical analysis of data, including a descriptive analysis of time series, parametric methods, as well as visual and calculated methods of estimating outliers.

Results and discussions
The first stage of the study is the analysis of the distribution parameters of the initial data series. Table 1 shows the estimates of numerical characteristics for all initial data series. According to the calculation results presented in Table 1, one may note some features of the concentration distribution along the length of the Velikaya River. For example, the main increase in the values of the skewness coefficient (g) is the most often observed within the urban area. Moreover, if the estimates of the numerical characteristics m and Сv vary over the sections and elements within small limits, the values of g from section to section often increase more than three times. Such big changes in the values of Cs with small changes in values of m and Сv the most often happen in cases where there are extreme values in the series of observations that much deviate from the values of all other members of the seriesthe so-called outliers. By definition, the outliers are the observations considerably higher or lower than most of the data, which infrequently but regularly occur [3].
As noted earlier [4], the outliers are caused by enterprises' emergency discharges or extreme weather. So, the genesis of outlier's formation differs from the main observational data.
To confirm the presence of outliers in the initial data series it is necessary to find the distribution laws from that series. At the same time, the standard distribution law, the law of Pearson type III, Kritsky-Menkel law, the log-normal law, and the Gambel distribution law are considered as the famous theoretical laws in statistical calculations.
Verification of the coincidence of empirical probability curves (the coordinates calculated directly from the initial series of observations) and theoretical probability curves according to the test Cramer-Mises-Smirnov nω 2 are carried out. According to [5], the value of the upper confidence limit of the statistics nω 2 at a 5% significance level is 0.461.
According to the results the distribution laws of Pearson type III (88% of the cases considered) and Kritsky-Menkel (12% of cases) is optimal in this case.
A visual analysis of the coincidence of the selected theoretical and empirical probability curves showed that there are empirical points in some series which are considerably higher than most of the data. For all series of observations, 28 such values were discovered. These concentration values deviating from the probability curves need to be taken as outliers. Verification of this assumption was carried out by testing statistical hypotheses according to the criteria of Dixon and Smirnov-Grabbs.
As the analysis showed for 14 values out of 28, the hypothesis refuted, which indicates that they do not belong to the initial series. This means that these values are the outliers. Thus, 4 outliers detected in the series of nitrate-nitrogen (N-NO3 − ), 2 outliers in the series of total iron (Fetotal), and 8 outliers in the series of BOD5.
Thus, two groups of observation data are selected in each initial series containing anomalous values. The first group dictates the choice of the theoretical distribution law and reflects the regular regime of watercourse pollution permissible in the conditions of the planned operation of enterprises. It comprises most of the values of the initial series. The second group (outliers) is not a subject to the estimated distribution laws and most likely reflects the pollution regime in emergency discharges or extreme weather [1].
The outliers, in most cases, last no more than a few days. At the same time, when average annual concentrations or volumes of pollutant runoff are calculated, the same weight given to each measurement, including outliers. For example, with conducted 12 measurements of concentration per year, each measurement in determining the average annual concentration is given a weight of 30 days. This leads to an unjustified overestimation of the numerical characteristics of the concentrations. In this regard, the concentrations attributed to the outliers are excluded from the calculation of the numerical characteristics of the series ( Table 2).
As follows from the presented data, by excluding outliers from the calculation, the average annual concentrations decreased by 5-10%. The measures of spread and skewness changed even more significantly. For example, the coefficient of variation (Cv) decreased by 10-20% and the skewness coefficient (g) decreased in some cases more than three times. Regarding this, it is possible to significantly reduce the extreme values of the nutrient concentrations during the normal planned operation of enterprises. For example, the possible values of nutrient concentrations with a probability of 1% (C1%) decrease on average by 5-20%, if the outliers are excluded. The inclusion of outliers of data beyond the scope of these samples in the general statistical analysis can significantly distort the results of the study and lead to an artificial overestimation of the characteristics of nutrient runoff.
On the other hand, as noted earlier, anomalous values (outliers) characterize the pollution regime of the watercourse during possible emergencies or extreme weather conditions. It is interesting and practically important to assess the extent of pollutants' concentration variability before described conditions. To assess the possible values of outliers it is necessary to find the law of their distribution. The determination of the theoretical curve of outliers for each series of observations is impossible in this case due to the lack of sufficient initial empirical material. In this regard, a decision was made, which based on combining all outliers for different elements and different data series into one sample. Moreover, it is assumed that the normalized initial series and time series of outliers are stationary and ergodic. The need to take into account these features of the initial series of observations of the hydrochemical regime of rivers is theoretically justified [6,7].
The combined values of outliers relate to different series of observations, and each of them has own numerical characteristics. Therefore, the concentration values are necessary to be normalized according to the mean value and standard deviation. x m t (1) Where j is the number of the series of observations, i is the number of the outlier in this data series, xji is the concentration of the ith outliers in the jth data series, mj and σj are the mean value and standard deviation of the jth data series, respectively. The values of the numerical characteristics of the formed series are shown in Table 3. As the results of the assessment showed, the type of optimal theoretical probability curves is identical to the type used for the initial series. Therefore, the Pearson type III distribution law can be considered optimal for the combined sample. Table 4 shows the results of calculating the coordinates of the probability curves for the combined series of outliers in normalized ordinates (tp) and modular coefficients (Kp).
where xjp is the value of the concentration of outliers of a given probability P, mj is the mean value, σj is the standard deviation, tjp is the normalized ordinate of the given probability. The results of such a calculation are shown in Table 5. These data are the values of concentrations that a given random variable would exceed with a fixed probability if the outlier occurred. For example, an excess of nitrate-nitrogen concentration at the Ostrov observation point (downstream cross-section) of 0.92 mg/dm 3 can occur with a probability of 0.1%. Thus, this case is much unlikely.
To calculate the exceeding probability at a particular time it is necessary to take into accounts the probability of the outliers themselves: where mi and ni are the number of outliers and the total number of observations in the ith observation series, respectively.
Thus, the probability of exceeding defined as the product (Pα) and the outliers' frequency of the ith element in the river (Pβ): So, such as taking into account the optimal distribution law and the provisions described above, it turns out that within the Ostrov observation point, the nitrate-nitrogen concentration values (N-NO3 − ) in water with a probability of 0,1 and 99% can be greater than 1.53 mg/dm 3 or equal to 0.37 mg/dm 3 . Thus, the extremes of the concentrations of rare and most frequent repeatability were determined. Indeed, the greatest concentration value in this sample illuminated by the observation period is 1.38 mg/dm 3 , and the sample average is 0.38 mg/dm 3 .

Conclusion
As the results of the study show, the outliers are observed in some data series. In this case, the outliers are the values of concentrations of pollutants significantly exceeding the concentration of most values of observations data. The reason for their presence in the series of observations may be emergency discharges of enterprises or extreme weather. The inclusion of outliers in the calculation along with the main observational data leads to a significant overestimation of the numerical characteristics of river pollution in the average annual and multi-year period.
Therefore, when assessing water pollution levels under normal conditions, it is necessary to estimate outliers at the preliminary analysis stage and exclude them from calculations.
However, it is necessary to assess the probability and possible values of the outliers themselves. In this case, proposed outliers assessment method is to combine the normalized outliers values for different series of observations in the river system into one series. The basis of this method is the assumption that the processes under consideration are stationary and ergodic. Therefore, it gives approximate characteristics of outliers.
This method for estimating outliers allows making significant adjustments in the calculation of nutrient transfer with river runoff at the preliminary analysis stage.
Authors would like to thank the Northwest Department of Hydrometeorology and Environmental Monitoring for making the data used in this article available to us.