Weather data errors analysis in solar power stations generation forecasting

. The paper presents a short-term forecasting model for solar power stations (SPS) generation developed by the authors. This model is based on weather data and built into the existing software product as a separate short-term forecasting module for the SPS generation. The main problems associated with forecasting the SPS generation on cloudy days were revealed in the framework of authors' research, which is due not to the error of the developed model but to the use of the same learning sample for both solar and cloudy days. This paper contains analysis of the main problems related to the learning sampling, samples pattern, quality and representativeness for forecasting the SPS generation on cloudy days. Besides, the paper includes a calculation example performed for the existing SPS and a detailed analysis of the forecast generation on cloudy days based on the actual weather provider data.


Introduction
Today, the main problem for the short-term (day ahead) forecasting of solar power stations (SPS) generation is the necessity to use an accurate and detailed weather forecast, which in turn is not cheap for the customer (SPS owner). In order to save money, the customer often uses the available free weather data, which, according to authors' research, is ultimately insufficient for accurate identifying the weather conditions necessary to solve the presented problem.
It should be noted that the weather data composition is of statistical significance from the point of view of the influence on the transparency factor calculation and allows to establish a reliable statistical relationship between the properties (cloud cover, the altitude of the solar disk) and the response (the solar radiation flux density on a horizontal surface) [1]. The impossibility of identifying the weather conditions leads to a high variance of the response y (transparency factor) with respect to the features under consideration xi (the estimate of cloud cover, the sine of the solar disk altitude angle).
An estimated evaluation shows that, on average, every fifth change in the actual value of the transparency factor does not correspond to a change in the actual cloud cover. At the same time, if in addition to take the forecast cloud cover into account, which is a priori less accurate than the actual one, the error of the initial data increases which introduces a significant error in the calculation results. To increase the accuracy of forecasting on days with precipitation, it is necessary to accumulate appropriate retrospective data and get a separate sample for days with precipitation. It is necessary to clearly distinguish the type of precipitation: drizzle, rain, rain shower, hail, snow, sleet, etc.

Short-term SPS forecasting model
The presented methodology is based on calculation of the solar radiation flux density (external illumination) on a horizontal surface for a six-day interval according to the weather provider data for half-hour intervals that is necessary for the regression model development. The methodology is described in detail in [2]. The main stages of the developed methodology implementation are presented below.

Forecast calculation of the solar radiation flux density incident on a horizontal surface`
The average value of the solar radiation flux density near the ground, incident on the horizontal surface, G for each hour of the forecast day is determined by the expression: To define a parameter G a physical model is used that allows to determine solar power flux density at the boundary of the atmosphere, while only the location of the power station, the number of the day in question and the hour of the day in question are needed. The algorithm for calculating solar power flux density at the boundary of the atmosphere is described by the following formulas.
 The average value of the solar radiation flux density at the boundary of the atmosphere 0 G [3]: To define a parameter T k a statistical model is used that allows determining the transparency factor for the next day. Retrospective cloud data from previous days, retrospective data on the solar radiation flux density from previous days, forecast cloud data for the next day, data on location of the power station, the number of the day in question, and the hour of the day are needed.
The regression model is described by the following expression: where In order to minimize the manual labor of the operational staff of the SPS in all subsequent calculations, it was decided not to implement the estimate of cloud cover cc in its original form, since this requires a complex analysis of retrospective data of considerable depth (up to two years) for the cloud cover TCC and density of the solar radiation flux incident on the horizontal surface of the earth G in order to determine the effect of the cloud character on the magnitude G .
In further calculations, the value TCC is equal to the total cloud cover TCC in relative units [4]: where cc is the cloud cover, [p.u.]; TCC is the total cloud cover, [%]. The transparency factor T k for the previous days is determined by the known values of the solar radiation flux density for the previous days G , obtained from the data of the local weather station, and the calculated values 0 G according to the formula: Regression analysis is performed for certain intervals of angles  . For each interval, the coefficients of the regression model 1 a , 1 b , 1 c , 1 d are calculated using expression [5][6][7]: where T B is the column vector of regression coefficients

Accuracy analysis of the SPS generation forecasting
Accuracy analysis of the forecast is performed by means of determining the forecast error. In this study, WAPE error on hours and WMAPE, % are evaluated [8,9]:  Weighted absolute percentage error (APE): where APE is the absolute percentage error,  Weighted mean absolute percentage error (WMAPE): Estimating the error with regard to the solar radiation flux density at the boundary of the atmosphere will allow estimating the error relating to the conditionally constant "base" and to show how many percent the error is from the maximum possible solar radiation flux density for a defined time interval [10].

Calculation example of short-term forecasting of SPS generation
Calculation example presents calculations of short-term forecasts within the period of 04.10.2017 -21. 10.2017 for SPS-1 using different learning samples. In accordance with the proposed clustering methodology, to improve the accuracy of short-term forecasting for the days under consideration, a new sample is formed for each day reflecting the weather conditions.
As part of the data provided by the weather provider, there are: a cloud cover estimate, %; air temperature, о С; relative humidity, %; wind speed, m/s. Cloud cover is a key parameter that affects the transparency factor. Thus, it is most expedient to cluster the initial array of retrospective data in accordance with the gradations of cloud cover. The following rules for the formation of blocks of retrospective data were used in the calculations:  The retrospective includes the values of the corresponding characteristics of x i and the y response for cloudy days having an average cloud cover within the light day that differs by no more than 30% from the average cloud cover within the light day in the current forecast day.  The retrospective includes the values of the corresponding characteristics of x i and the y response for cloudy days having an average cloud cover within the light day differing by not more than 50% of the average cloud cover within the light day in the current forecast day.  The retrospective includes the values of the corresponding characteristics of x i and the y response in the range of the minimum and the maximum actual cloud cover for the light day in the current forecast day.  The retrospective includes the values of the corresponding characteristics of x i and the y response for all available range. An example of calculation results is presented for the two most problematic (cloudy) days on October 17-18, 2017 in Table 1.

Retrospective analysis
In this research, a 6-day retrospective was used to forecast the SPS generation initially for its program implementation. While using a 6-day retrospective in the framework of the presented algorithm, the obtained error is random since the learning sample to calculate the regression coefficients "slides" in 1-day step. In this case, it may occur that the forecast is performed on a cloudy day based on clear days and alternately, which negatively affects the accuracy of short-term forecasting and leads to unreliable results.
In the framework of previous research, the recommendation to use the 6-day retrospective was due to the following factors:  The amount of initial data on SPS, which was originally provided by the customer for a period of 2 weeks, containing full information about weather conditions and the solar power flux density measurements at SPS.  The Customer's requirement to minimize the necessary amount of retrospective data to ensure the possibility of rapid implementing the forecasting system at new photovoltaic stations.

Regression function coefficients
The coefficients of the regression function are chosen in such a way as to minimize the approximation error of the available retrospective data. In the absence of some combination of xi and the y response in the retrospective data, for example, a negative design transparency factor might be obtained when the cloud cover increases to 0.9 -1.0, which indicates the necessity to expand and/or refine the learning sample. Therefore, in order to obtain a stable forecast, the learning sample should contain data within the order of 80% for cloudy days, about 20% for clear days. Even when forecasting the solar power flux density for cloudy days, the presence of clear days in retrospective data is necessary to obtain stable coefficients of the regression function.

Learning sample expanding
The expansion of the learning sample positively affects the accuracy of short-term forecasting. With averaging of the best results obtained within the period of 04.10.2017 -21. 10.2017, the mean absolute percentage error was about 28%. The mean absolute percentage error in the solar power flux density forecast was 49%. The use of the entire available retrospective on average over the period under review provides an error of 38% For the period under consideration, a number of "problem" days can be identified, when the mean absolute percentage error can reach 100% or more. This problem is due to atypically low transparency factors for the presented cloud conditions. For example, these are 18.10.2017, 17.10.2017.
The retrospective database provided by the Customer does not contain information that uniquely identifies weather conditions; therefore, additional open sources on the Internet have been analyzed to identify cause-effect relationships. At the same time, the database of actual weather conditions of weather providers A and B was considered.
It is remarkable that open databases of actual weather information are not available for specific geographic coordinates and are associated with the locations of weather stations. The closest to the SPS-1 location meteorological station is the Airport-1 weather station, located at a distance of 45 km from SPS-1.
The meteorological station remoteness from the SPS-1 location does not allow using the weather data archive for forecasting, but it provides an opportunity to assess the meteorological situation. Both weather providers A and B provide the same information on the Airport-1 weather station, but B presents more detailed set of meteorological parameters.
For the forecast day of 18.10.2017, broken clouds with a low cloud height of 500 m are characteristic. For the forecast day of 17.10.2017, broken clouds with a low cloud height of 500 m are typical; the type of clouds is cumulonimbus, precipitation is in the form of rain and rain shower ( Table 2).
Presented days are characterized by the presence of dense cloud cover and precipitation. As a rule, SPS are built in areas with the greatest number of sunny days per year, therefore the presence of days with precipitation is atypical and the forecast for these days is built with a significant overestimation of the transparency factor, and as a consequence, of the solar power flux density.
Increasing the accuracy of the forecast was achieved by constructing a separate sample of days with precipitation according to the data presented in the archive of Airport-1 on the portal of weather provider B. Thus, the forecast error was reduced: -for 17.10.2017: from 135% to 100%; -for 18.10.2017: from 196% to 75%.
To increase the accuracy of forecasting on days with precipitation, it is necessary to accumulate an appropriate retrospective of the data and to construct a separate sample for days with precipitation. It is necessary to clearly distinguish the type of precipitation: drizzle, rain, rain shower, hail, snow, sleet, etc.

Revealing the main errors in the learning sample
The greatest contribution to MAPE for the considered day is provided by the hours when the actual data on cloud cover does not correspond to the measured value of the transparency factor. For example, an increase in cloud cover in percent leads to an increase in the transparency factor or alternately. Verification of the data archive issued by the customer was carried out under the following conditions:  The change in the transparency factor and cloud cover in r.u. are of the same sign; for example, a decrease in cloud cover corresponds to a decrease in the transparency factor.  Change of one value by more than 0.1 r.u. relative to another does not lead to a change in the second value. For a sample of 0-20 degrees for the solar altitude angle above the horizon, the number of discrepancies was 137 from the total sample of 503 values (27.2%). For a sample of 20-40 degrees for the solar altitude angle above the horizon, the number of discrepancies was 56 from the total sample of 295 values (18.9%). An estimated evaluation shows that, on average, every fifth change in the actual value of the transparency factor does not correspond to a change in the actual cloud cover. Moreover, in case of taking the forecast cloud cover into account, which is a priori less accurate than the actual one, the error of the initial data increases that introduces a significant error in the calculation results.
The impossibility of identifying weather conditions leads to a high variance of the response y (transparency factor) with respect to the features under consideration x i (the estimate of cloud cover, the sine of the solar disk altitude angle). For example, Figures 1-3 show the transparency factor function for the angles of the solar disk position above the horizon of 0-20 degrees in the point cloud of the retrospective data (the regression function coefficients were obtained for all the available retrospective data on SPS-1 for angles of 0-20 degrees: a 1 = 0.3273, b 1 = -1.1467, c 1 = -0.1404, d 1 = 1,0551).   It should be noted that the very system of converting the initial data, provided by the weather provider, from the linguistic form to the numeric one by the Customer also helps to reduce the informative content of the learning sample.

Conclusion
It is necessary to accumulate additional information on meteorological events and conditions to increase the accuracy of short-term forecasting of SPS generation on cloudy days. It will allow increasing the amount of retrospective and dividing the sampled population into separate samples that will better identify such events as "heavy/light rain", "rain shower", "drizzle", "heavy/light snow", "hail", "fog", "mist", and also classify cloud cover. The daily preservation of all possible parameters provided by the weather provider of actual and forecast information (including the original one -the linguistic data provided by the weather provider) is needed for the subsequent improvement of the quality of the SPS generation forecast by reducing the variance of the measured cloud cover values and the calculated transparency factor.

Kt1 Kt
The recommended amount of retrospective allowing accounting the climatic features of the SPS location, as well as clustering of the sampled population into separate samples, is at least 1 year. Provided there are no additional characteristics of weather conditions and events, for short-term forecasting it is recommended:  To form a retrospective for clear days in the range of an average cloud cover over a day of 0.0 -0.5 r.u.  To use the entire available retrospective for cloudy days in the following ratio: 20% of clear days, 80% of cloudy ones. For new objects (returned to service) it is recommended to use the learning sample according to the rules described above, while ensuring that monthly retrospective record of the data (mostly cloudy days) per sample is made during the current year. The following ratio in the complete sample should be maintained: no more than 20% clear days, not less than 80% of cloudy ones, in addition keeping the total number of pair values (input-output) in the sample of not less than 800 pcs. Record of new retrospective data to the complete sample is made taking into account the recommendations described above, including in its original form -the linguistic data provided by the weather provider.
It is recommended to refine the calculation and display of the forecast error for the short-term SPS generation forecasting and use the APE to estimate the error by hours and the MAPE to estimate the error by day.