Research on the error probability distribution of photovoltaic output prediction based on output fluctuation characteristics and Generalized Gaussian Mixture Model

. Photovoltaic power output forecast error exists objectively and inevitably, and it can provide a guarantee for safe and stable operation of the power system through analyzing its characteristics. In this paper, the influence of predicted output fluctuation characteristics (predicted output amplitude and power variation) on prediction error was studied based on the analysis of variance (ANOVA) method. The prediction error conditions were classified into six types based on the clustering of numerical characteristics of predicted output. Then, a Generalized Gaussian Mixture Model (GGMM) was proposed to fit the prediction error distribution of each type of photovoltaic output. The mean absolute error (MAE), coefficient of determination (R 2 ), and root mean square error (RMSE) were used as accuracy evaluation indexes. The example analysis showed that the GGMM can satisfy the asymmetry and kurtosis diversity of the error distribution after division by conditions, and the fitting result is better than that of the normal distribution, improved Laplace distribution and t Location-Scale distribution model.


1Introduction
Nowadays, the global energy crisis and environmental pollution are aggravating. In the face of this situation, accelerating the development of photovoltaic power generation technology as the representative of clean renewable energy has become the inevitable choice of countries around the world. Nevertheless, as large-scale photovoltaic connected to the power grid, its fluctuation and intermittency will lead to a significant increase in the uncertainty of power system operation, which undoubtedly brings new challenges to the scheduling operation of the power system. In response to this, photovoltaic output prediction is the main tool to reduce this problem. As a significant part of photovoltaic output prediction, research on the error distribution characteristics can provide statistical basis for uncertainty analysis of photovoltaic output prediction, which has a great significance to the safe and stable operation of the power system [1,2].
At present, there are only a few studies on the prediction errors of photovoltaic power generation in the literature. Moreover, among this small amount of literature, there is also some literature that describes the photovoltaic output prediction error based on the assumption that the error obeys a normal distribution [3,4]. The literature [5] argued that the normal distribution model assumptions are reasonable based on the characteristics that the expected value of the prediction error being zero and the prediction error distribution obeying the normal distribution. The literature [6] argued that the normal distribution could describe the variation pattern of photovoltaic generation prediction errors through statistical analysis, and it could be used as a random variable when studying the optimal scheduling problem of microgrid. The above assumptions lack strong theoretical proofs. It was found that it is more reasonable to use the t Location-Scale distribution model [7] than the normal distribution to describe the photovoltaic output prediction error. However, at the same time, the t Location-Scale distribution model has the problem of insufficient waist flexibility.
This paper clusters the prediction errors according to the corresponding predicted output magnitude and power variation conditions based on the clustering of the numerical characteristics of the predicted output. Then a generalized Gaussian mixture model is proposed to describe the distribution of photovoltaic power output prediction errors. The model can describe the error distribution of different kurtosis and shapes accurately. The prediction error clustering analysis method proposed in this paper can not be affected by the prediction algorithm and the geographic location of the photovoltaic plant, and its scope of application is more extensive.
2Short-term photovoltaic power output prediction error model considering output fluctuation 2.1Effect of predicted output fluctuation on prediction error By using the MPPT controller, the photovoltaic cell array makes the output power always maintain at the maximum power point. When the external environment such as light intensity and ambient temperature changes, the operating voltage of the cell will follow the shift, and the starting point, step, and direction of the shift may produce prediction errors, and the performance of the MPPT controller influences the magnitude of such errors. The correlation between such prediction errors and the influencing factors is investigated from a statistical point of view. The predicted output amplitude is chosen to represent the starting point of the movement, which is referred to as the E factor, and the magnitude of the adjacent predicted force difference is chosen to represent the power variation, the positive or negative of which indicates the direction of movement, which is referred to as the G factor. The E and G factors are arranged in ascending order and are set to 8 levels according to the principle of equally dividing sample size. The values of the two factors at each level are shown in Table 1, and the data in the table are standardized as taking the rated capacity as the reference value.; By counting the predicted output amplitude, step size and absolute prediction error of a photovoltaic power system in a photovoltaic power plant for three years when the short-term predicted output is not zero, the effect of E factor, G factor, E factor and G factor interaction effect (E*G) on the PV prediction error is investigated by using ANOVA method.
The basic principle of ANOVA is to divide the difference between means of different treatment groups into between-group differences and within-group differences. The difference between groups is expressed as the sum of the squared deviations of the mean value of the variable in each group from the total mean value, denoted as SSb, the degrees of freedom between groups as dfb, and the ratio of the two as mean squared MSb. The difference within groups is expressed as the sum of the squared deviations of the mean value of the variable in each group from the value of the variable within that group, denoted as SSw, the degrees of freedom within groups as dfw, and the ratio of the two as MSw. The MSb/MSw ratio constitutes the F distribution, and the difference between the means of the groups is statistically significant when the F value is much greater than 1 for comparison with 1. The probability p-value of the F-value being greater than a specific value under the condition that the test hypothesis holds was obtained by consulting the F-boundary table. If the F-value was close to 1, the difference between the means of the groups was not statistically significant. The test level was selected as 0.05 and the original hypothesis was rejected at p<0.05 as a significant difference, that is, the test factor had an effect on the study subject, and conversely the original hypothesis was accepted as no significant effect of the test factor on the study subject. The test parameter values corresponding to each factor can be obtained by combining the factor and level relationships in Table 1, as shown in Table 2. The first column "Source" in Table 2 is the source of variance, the second column "Sum Sq" is the sum of squares corresponding to each source of variance, the third column "df" is the corresponding degrees of freedom, the fourth column "Mean Sq" is the corresponding mean square, the fifth column "F" is the observed value of the F-test statistic, and the sixth column "P" is the test p-value obtained from the distribution of the F-test statistic. It can be seen that the p data are all less than the significant level 0.05, thus determining that the E factor, G factor and E*G factor have a significant effect on the magnitude of the prediction error.

2.2.1GGMM.
There is asymmetry and kurtosis diversity in the short-term photovoltaic output prediction error distribution. Therefore, prediction error models are required to have flexible shapes and peaks. The traditional Gauss Mixture Model (GMM), is a weighted sum of multiple Gaussian functions, and the probability density function is defined as in equation (1). The number of summation terms n is usually taken as 3-5.
Where, k a is the weighting coefficient, 0 k a  , is the Gaussian distribution density function, as in equation (2).
(2) The cumulative distribution function is: The range of the GMM is ( ) − +  ， , and the integral of each Gaussian component in this scope is 1. To ensure that the integral of the overall probability density of the GMM is also 1, it is necessary to assign a weight to each Gaussian component with a value no greater than 1, and the sum of the weights is 1. However, the probability distribution of photovoltaic output prediction error obtained by the clustering method proposed in this paper cannot reach the range of ( ) − +  ， , and the prediction error model is required to have an integral sum of 1 in the given value range. Therefore, the GGMM is proposed in this paper. The definition formula of GGMM is basically the same as that of GMM, as in equation (1) and equation (2), only that the sum of weights is no longer required to be 1, that is, 1 1 n k k a =   , GGMM is more flexible than GMM and more suitable for describing the distribution of photovoltaic output prediction errors.

2.2.2Model parameter estimation and accuracy evaluation index.
In this paper, the least squares method is used to estimate the model parameters, and the parameter estimates is obtained by nonlinear curve fitting function lsqcurvefit in MATLAB.
Three evaluation indexes, MAE, R2 and RMSE, were selected to evaluate the fitting effect. The MAE value characterizes the difference between the fitting function and the actual distribution function. The closer the MAE value is to 0, the better the model fitting effect and the data prediction effect is. The R2 value characterizes the closeness of the correlation. The closer the R2 value is to 1, the higher the reference value of the relevant equation is, while the closer the R2 value is to 0, the lower the reference value is. The RMSE is very sensitive to the response of extra-large or extra-small errors in a set of fits so that it can reflect the precision of the fit well. The closer the RMSE value is to 0, the higher the precision of the model fit. The specific formulas are as follows (4)-(6).

3Example analysis
In this paper, the short-term predicted output and measured data of a photovoltaic power plant for three years are selected as training and testing samples to test the accuracy of the proposed model. The predicted and measured output data are standardized.
To illustrate the accuracy of the GGMM for describing the error distribution of short-term photovoltaic generation predicted output, the normal distribution, t Location-Scale distribution, improved Laplace distribution, and three GGMMs are used to fit the error samples of the same type, and the accuracy of each model is analyzed by comparing the fitting effects, as shown in equations (7)-(9).
(1) t Location-Scale distribution ( )  Figure 2 and Table 5.  From the results in Figure 2 and Table 5, it can be seen that: For the distribution case with prominent peak, the accuracy of the normal distribution model is the lowest, which is due to the influence of its kurtosis and cannot meet the characteristics of the high and lean distribution of the photovoltaic output prediction error. The accuracy of improved Laplace distribution is relatively low, which is due to the peak coefficient of this function is associated with its shape coefficient, therefore it makes the shape of the function limited. It cannot meet the two characteristics of high and lean peak as well as gentle waist at the same time, and cannot meet the requirement of peak. The accuracy of t Location-Scale is better than the first two, and the fitted curve at the peak basically overlaps with the GGMM curve. However, its shape is too tall and thin, and the curve at the waist is lower than the empirical distribution value. The GGMM fitting curve meets both peak and flexible waist, which has the best fit with experience distribution.
In the case of the kurtosis value slightly higher than the normal distribution, the normal distribution has a similar trend to the empirical distribution beyond the peak. It has a gentle waist curve but lacks the peak value. The waist fitting curve of the improved Laplace distribution is lower or higher than the empirical distribution when the peak value reaches the required value. The overall fitting effect of T Location-Scale distribution model is the closest to GGMM model. The GGMM accuracy is slightly less than that of the case with prominent peak, but the evaluation indexes are still better than the other three models, so it is still the best choice.
In summary, the normal distribution model cannot express the asymmetry of the short-term photovoltaic power output prediction error, so it is not suitable to describe the photovoltaic power prediction error. The improved Laplace distribution model cannot describe the distribution condition of peak prominence. The t Location-Scale Distribution model has the problem of insufficient waist flexibility. The GGMM has a more flexible shape than other functions, and the evaluation index of GGMM is optimal for describing the prediction error of photovoltaic power generation with higher accuracy.

4Conclusion
This paper establishes a photovoltaic power output prediction error model based on the predicted output fluctuation, and proposes the use of GGMM to describe the photovoltaic power prediction error. Meanwhile, the GGMM is compared with t Location-Scale distribution, improved Laplace distribution and normal distribution model through an arithmetic example of a photovoltaic power plant output prediction to verify the superiority of GGMM, and the main conclusions obtained are as follows.
(1) The error value at each moment of photovoltaic power output prediction error is correlated with the predicted output amplitude and power variation at that point, which can be divided into six types of error conditions according to the principle of equally dividing sample size, and the error distribution shows asymmetry and kurtosis diversity.
(2) The GGMM proposed in this paper is flexible in shape and can satisfy both the asymmetry and kurtosis diversity of photovoltaic prediction error distribution with higher accuracy and applicability compared to normal distribution, improved Laplace distribution and t Location-Scale distribution model.