Study on Ultra-short-time Power Forecast of Photovoltaic System based on Ground-based Cloud Image Recognition and Key Impact Factors

. In recent years, under the dual pressure of resource shortage and environmental pollution, the photovoltaic (PV) power generation industry has flourished. The irradiance forecasting technology of PV power plants is of great significance for output prediction, grid dispatching and safe operation. Cloud cover is always the key factor making the irradiance fluctuate. In this article, colorful ground-based cloud images are collected by the all-sky imager every minute as the research object. Based on the traditional threshold method, a hybrid entropy threshold method is proposed to identify cloud clusters. Using the correlation analysis, among many impact factors with high correlation, five are extracted as input parameters of a BP network optimized by genetic algorithm (GA-BP). Through verification and comparison analysis, it is concluded that the recognition accuracy of the hybrid entropy threshold method is higher, and the average relative error can be controlled at about 5%. Based on this, the irradiance prediction of GA-BP also achieved better results than other models. It can meet the application requirements of PV power plants.


Introduction
As the industrialization trend continues to spread throughout the world, active industrial production consumes a lot of resources. Traditional coal, oil, natural gas and other chemical raw materials have revealed many drawbacks in the process of large-scale use. Therefore, solar power, a new type of clean energy with no pollution, low cost and high energy, has received unprecedented attention, and the photovoltaic (PV) power generation is developing rapidly [1] . However, the solar power capacity of PV system is obviously affected by many kinds of environmental factors. Studies have proved that when the installed PV generating capacity accounts for more than 15% of the grid, large fluctuations in the PV system will lead to the paralysis of the entire grid [2] . Based on these uncertainty and insecurity, accurate forecasts of solar irradiance and PV output are particularly important and necessary.
At present, the existing forecasting methods mainly include direct forecasting and indirect forecasting. The direct method uses statistical methods to predict PV power generation. For example, it can use multiple regression analysis to establish the prediction equation [3] , or the optimization method of L-M to build a Multilayer Perceptron (MLP) to approximate the voltage-current curve [4] . However, one disadvantage of this method is that the quality and quantity of historical data will have a direct influence on the predicting results. The indirect prediction method firstly predicts the solar irradiance, then it combines predicting results of irradiance with various types of meteorological data as the input parameters of the photoelectric conversion efficiency model, and use this model to finish the comprehensive forecast. There are some tested results which are based on the autoregressive moving average model (ARMA) [5] , or the support vector machine (SVM) to establish an irradiance attenuation model [6] .
In the indirect prediction method, due to the complex disappearance, translation and rotation of clouds, the research and application focusing on ground-based cloud images are still relatively limited. Cloud cover is a nonnegligible factor that causes fluctuations in PV output. It is of great significance for improving forecasting accuracy to figure out how to fully extract the features of ground-based cloud images and combine them with PV system operating parameters and numerical weather forecast information to establish a multi-information fusion model.

Irradiance Analysis in PV Power
Station and Data Pre-processing

Change law Analysis of Solar Irradiance
Solar zenith angle is used to describe the relative where θ is the solar zenith angle. GHI and DHI are not significantly affected by the change of cloud cover. However, when clouds appear, DNI will fluctuate obviously. Therefore, the following content mainly predicts DNI in the cloudy weather. In actual conditions, due to various meteorological and environmental factors, the true irradiance often has a large gap with the calculated value. Figure 1 is the daily exposure of the PV power plant of US National Renewable Energy National Laboratory for seven years from 2011 to 2017 [7] . The curve of exposure varies with the year. Figure 2 is the annual irradiance curve in hours, taking the year 2017 as an example.  It can be seen from the above two graphs that changes of the solar irradiance have an obvious time cycle, and the exposure changes periodically every year. The overall daily irradiance curve presents a parabolic shape, which means the irradiance is lower in the morning and evening but highest at noon. Some of these curves fluctuate irregularly throughout the day, indicating that the weather is cloudy and the prediction is difficult.

Correlation analysis
In addition to geographic location, time and cloud cover, DNI is also related to many complex meteorological and environmental factors. It is very important to choose an appropriate reference in the irradiance prediction. Pearson correlation analysis is used to analyze the correlation between six kinds of typical meteorological factors and irradiance, including temperature, humidity, pressure, wind speed, wind direction and cloud cover. At last, the most relevant ones can be selected as the input of the prediction model.
Pearson correlation coefficient is one of many statistical correlation coefficients, also known as productmoment correlation. Assuming there are two data series X and Y, the calculation of Pearson correlation coefficient between the two variables is as follows: where N is the number of data in each sequence. Pearson correlation also defines the significant coefficient p. The definition of the significant coefficient p and the degree of correlation are shown in Table 1. Use SPSS to analyze the correlation and get the results shown in Table 2. It can be seen that DNI is significantly correlated with temperature, relative humidity, and total cloud cover at the level of 0.01, and negatively correlated with humidity, pressure, and cloud cover. It has no obvious correlation with the average wind direction. Comparing the absolute value of the correlation coefficient, we can draw the conclusion that DNI has the largest correlation with temperature, which is 0.943 and close to 1, followed by cloud cover, humidity, pressure and wind speed.

Data pre-processing and the selection of evaluation indexes
Since the input parameters have different dimensions, in order to map all the features to a uniform scale, the input data needs to be normalized. The method of extreme value normalization maps all parameters to the interval [0,1]. The calculation formula is: The ground-based cloud images used in this article are all the real-time cloud images automatically collected by the EKO SPF-02 all-sky imager every 1 minute. The image resolution can be 1600×1200 pixels [7] . After using calibration and correction to remove distortion, the normalized red-to-blue ratio NRBR is taken as the feature value for graying processing [8] and convert the image from 3D to 2D. The processing formula is: where B is the blue channel value of the image, and R is the red channel value.
In order to compare the prediction performance of different prediction models, we select the root mean square error RMSE and the average relative error MRE to measure the prediction accuracy. The calculation method is 2 where e is the absolute error between the true value and the predicted value. RE is the relative error between the true value and the predicted value, and n is the total number of data.

Traditional threshold method
Threshold method is a classic method in the problem of image segmentation. The basic principle is to find an appropriate gray-scale threshold T and go through the gray-scale value of each pixel in the image. In this way, we can classify those pixels with gray-scale greater than the threshold into one category, and others into another class. Methods to find the threshold mainly include fixed threshold method, the maximum information entropy threshold method and the minimum cross entropy threshold method [9] . Fixed threshold method is to segment images by artificially setting the threshold. In this article, the empirical value of 0.25 is selected as the gray threshold [10] . The maximum information entropy threshold method is to continuously adjust the threshold to maximize the sum of the information entropy of the target and the background [11] , while the minimum cross entropy threshold method is based on the principle that the image intensity remains same before and after image segmentation and reconstruction. In this method, we can minimize the cross entropy by constantly adjusting the threshold [12] .
Suppose that L is the number of gray levels in an image, i is a gray value, h(i) is the number of pixels of each gray level, and T is the gray threshold. Besides, the probability of each gray value in the image is p i and the sum of probabilities of the target pixel is p s . After the image is segmented, the gray-scale reconstruction is performed according to the following formula to become a binary image.

(7)
The information entropy of the target H A and the information entropy of the background H B are respectively The threshold T of the maximum information entropy segmentation method is expressed as: The cross entropy of the cloud image is The threshold T of the minimum cross entropy threshold method is expressed as: In order to reflect the recognition results more intuitively, we define the cloud amount as the ratio of the number of detected cloud cluster pixels to the total number of pixels in the effective area of the image. The calculation formula is: Use these three threshold methods to test the groundbased cloud images collected every 10 minutes from 7:00 to 16:30 on December 14, 2017 and calculate the cloud amount identified by each image, then we can draw a recognition curve, as shown in Figure 3 (MIE = maximum information entropy, MCE = minimum cross entropy). We can also get the relative error curve between the real value and the recognition result, as shown in Figure 4. Figure 5 shows the image taken at 13:00 and binarized images processed by three threshold methods. The white region is the cloud and the black is the sky.  After calculation, the average relative error of the whole day using the fixed threshold method is 10.77%, the maximum information entropy threshold method is 21.84%, and the minimum cross entropy threshold method is 18.55%. Analyzing the curve, it can be found that the fixed threshold method has better detection results in the parts with lower and higher cloud cover. In cloudy weather, the thickness of the cloud layer is uneven, and there is no obvious difference between the gray value of the thin cloud part and the background. However, the fixed threshold cannot automatically adjust based on the image characteristics. We can also find that the cloud amount detected by the maximum information entropy threshold method is generally higher than the true value, while the cloud amount detected by the minimum cross entropy threshold method is generally lower than the true value.

Hybrid entropy threshold method
In order to achieve better recognition accuracy, the advantages of these three threshold methods are complemented, and a new hybrid entropy threshold method is proposed. The processing process is: (1) Calculate the variance S of the gray-scaled ground-based cloud image. If the weather is sunny or overcast, the distribution of pixels in the image is balanced, so the variance is small. In the same way, the image variance is large in cloudy weather. Through repeated attempts, we finally determine the threshold of the image variance Ts as 2200.
(2) For sunny/overcast cloud images, the fixed threshold method has a good recognition effect, and the threshold 0.25 is used for image segmentation.
(3) For a cloudy cloud image, through repeated adjustments in the interval [T2, T1] formed by the maximum information entropy threshold T1 and the minimum cross entropy threshold T2, the quarter point close to T2 is finally selected as the optimal threshold for image segmentation.
The statistics results of variance in ground-based cloud images are shown in Figure 6, and the flowchart of the hybrid entropy threshold method is shown in Figure  7.  The hybrid entropy threshold method is used to process the same cloud images as the traditional threshold method. The cloud amount is shown in Figure  8, and the relative error of each image is shown in Figure  9. The processed binarized image is shown in Figure 10. For comparison, the processing results of the traditional threshold method in Section 3.1 are also summarized in Table 3. After analysis, it is concluded that the hybrid entropy threshold method has achieved better recognition accuracy than the traditional threshold method. The average relative error throughout the day is only 5.23%, and the maximum relative error of a single processing is only 14%. This new method is generally applicable to sunny, overcast and cloudy days within the error tolerance range. According to the binarized image, it can be seen that the thin and thick cloud clusters in the cloud image are all well detected. The method can meet the requirements of real-time recognition of ground-based cloud images in different kinds of complex weather.

BP Neural Network
BP algorithm is the Error BackPropagation Algorithm, which is mostly used to train multi-layer feedforward neural networks. Its basic structure is shown in Figure 11. BP neural network propagates information unidirectionally. The output of the input layer is equal to its input, and the input of hidden layer and output layer nodes is the weighted sum of the output of their previous layer. The degree of activation in each layer is determined by its activation function. Because BP has a strong nonlinear mapping ability, it's used to build a prediction model of PV output based on the BP model. The most significant feature of BP neural network is that the signal is propagated forward but the error is propagated backward. If the output layer does not get the expected output, the error will be propagated back and the weights and thresholds will be adjusted based on the deviation, so that the prediction result is constantly approaching the expected output.

Genetic algorithm optimization BP neural network
The selection of the initial threshold and connection weight of the BP network has a decisive impact on the prediction results. If the initialization is not suitable, it will easily lead to the slow convergence of the BP network and make the model fall into local extremes. In order to find the optimal initialization parameters, genetic algorithm can be used to optimize the traditional BP network. Genetic Algorithm (GA) is a parallel random search method simulating the genetic mechanism of nature and the theory of biological evolution. It can construct a limited number of similar individuals and select them from generation to generation based on the selected fitness function. As a result, chromosomes with good fitness are retained, and those with poor fitness are eliminated. The optimization is repeated in cycles. The main operation of optimization is as follows: (1) Encode chromosomes during the initialization process of the population, and use the connection weights and thresholds of each layer to create a chromosome in the form of hybrid coding with binary and real number.
(2) Input the weights and thresholds into the BP network to calculate the error between the predicted result and the expected value. The fitness function F has a variety of expressions. we chose the form based on the sum of squares of errors. (4) Substitute the obtained optimal weights and thresholds into the BP network and use training data for training.
Genetic algorithm optimization is equivalent to the process of back propagation and automatically updating the weight threshold in the original BP network. However, the genetic algorithm finds the optimal individual through evolutionary operations to make the initial parameters optimal. In this case, BP network will not fall into a local minimum due to the unreasonable initial value, so as to obtain a better prediction effect.

Network structure
In order to prove the effectiveness of genetic algorithm optimization and to prove that the cloud amount extracted by the hybrid entropy threshold method can improve the accuracy of irradiance prediction, four prediction models are established for comparative analysis. Model 1 is a BP model without cloud image processing. Model 2 is a BP model including cloud image processing. Model 3 is a BP model optimized by genetic algorithm (GA-BP) without cloud image processing, and Model 4 is a GA-BP model including cloud image processing. In addition, according to the results of correlation analysis, we also choose six highly correlated impact factors as the input parameters of these models. The four models all adopt a three-layer network structure, that is, there is only one hidden layer. Other network structure parameters include: (1) Model input: For Model 1and Model 3, the input parameters include the measured irradiance at t-2△t, t-△ t and t. There are four other parameters including temperature, humidity, pressure and wind speed. For Model 2 and Model 4, we need to add the cloud cover at time t as an input parameter, which are detected by the hybrid entropy threshold method. Therefore, there are a total of eight input parameters.
(2) Model output: the predicted value of irradiance at time t+△t.
(4) The learning rate of BP network is 0.01. The size of genetic population is 40. The crossover probability is 0.3 and the mutation probability is 0.1.

Analysis of prediction results
Using these four prediction models combined with meteorological data and ground-based cloud maps, the irradiance on December 14, 2017 is predicted at 10 min intervals. It is known that the weather at some times of the day is cloudy, some are sunny or overcast, which can help verify the versatility and flexibility of our model in various weathers. Write a program in MATLAB to build the corresponding model for training and testing, and draw four irradiance prediction curves and true value curves.
For comparison, we have also tried some simplest time series prediction models. The input parameters of these models only contain historical irradiance data and do not include any other impact factors. Among them, LSTM has the best performance [13] . Therefore, its prediction curve is also drawn, as shown in Figure 12. We use RMSE and MRE to evaluate prediction results, as shown in Table 5.1.  According to the curve in the figure and the statistical results in the table, the following conclusions can be drawn: (1) RMSE and MRE of GA-BP model are both reduced. This result verifies that the optimization of the initial parameters by the genetic algorithm has achieved the expected effect, and the prediction accuracy is improved.
(2) With the same algorithm, we can find that when the input parameters include the cloud amount, RMSE and MRE are smaller. This result shows that the extraction of cloud amount plays an important role in improving the accuracy of prediction. Irradiance variation is closely related to the cloud movement. The neural network can find the non-linear mapping relationship between cloud cover and irradiance to improve the prediction effect.
(3) RMSE and MRE of GA-BP model with cloud amount, temperature, humidity, pressure, and wind speed are much smaller than that of the time series model LSTM, which proves that the choice of input parameters has a great impact on the prediction performance. Since irradiance is affected by many factors, adding highly correlated key impact factors can help improve the accuracy of the prediction.

Conclusion
Referring to the traditional time series prediction model, we established a multi-information fusion prediction model using a variety of meteorological information as well as cloud images. Through Pearson correlation analysis, we can choose key impact factors with obvious influence on the irradiance trend. Based on the traditional threshold method, a hybrid entropy threshold method is proposed to realize cloud image recognition with higher accuracy and extract the cloud amount for irradiance prediction. Genetic algorithm is used to optimize the traditional BP network. It improves the prediction accuracy of the model. As a whole, we have established an irradiance prediction model based on ground-based cloud images and multi-information fusion, which has important application value for PV output prediction and the improvement of the operational efficiency of the PV power industry.