Forecasting of Three Components of Solar irradiation for Building Applications

. Solar energy and the concept of passive architecture and Net Zero Energy buildings are being increased. For an optimal management of the building energy, a Model Predictive Control is generally used but requires an accurate building model and weather forecast. For a more reliable modelling, the knowledge of the global solar irradiation is not sufficient; three methods, smart persistence, artificial neural network and random forest, are compared to forecast the three components of solar irradiation measured on the site with a high meteorological variability. Hourly solar irradiations are forecasted for time horizons from h+1 to h+6. The random forest method (RF) is the most efficient and the accuracy of forecasts are in term of nRMSE, from 19.65% for h+1 to 27.78% for h+6 for global horizontal irradiation, from 34.11% for h+1 to 49.08% for h+6 for beam normal irradiation, from 35.08% for h+1 to 49.14% for h+6 for diffuse horizontal irradiation. The improvement brought by the use of RF compared to the two other methods increases with the forecasting horizon. A seasonal study is realized and shows that the forecasting during spring and autumn is less reliable than during winter and summer due to a higher meteorological variability .


Introduction
There has been a rapid increase in solar passive architectural buildings in the last few years to achieve the net-zero energy concept. Due to the dynamic change in solar radiation, reliable energy generation forecasting is necessary for grid operation in the case of solar energy generation and also for passive solar architectural building design for the optimal thermal performance of buildings [1]. Net Zero Energy Buildings are at the frontier of the energy efficiency and renewable energy sources integration in buildings. A proper design of these buildings, taking into account their connectivity, is a key stone to reach them. Once designed their operation requires of an optimal control system called Model Predictive Control (MPC) [2]. MPC requires a model of the system, real-time controllers and weather forecasts. A lot of attention has been paid to the two first: obtaining good models for buildings and checking the robustness of the control scheme adopted. However, less attention has been paid to the weather forecasting requirements for such applications [2]. MPC is a control that employs an explicit model of the system to be controlled which is used to predict the future output behaviour. This prediction capability allows solving optimal control problems on line, where tracking error, namely the difference between the predicted output and the desired reference, is minimized over a future horizon, possibly subject to constraints on the manipulated inputs and outputs. The result of the optimization is applied according to a receding horizon philosophy: At time t only the first input of the optimal command sequence is actually applied to the system. The remaining optimal inputs are discarded, and a new optimal control problem is solved at time t+ 1. There is extensive literature covering this field [3][4]. Different solar systems require different solar forecasts. For solar concentrating systems the normal beam incident irradiance must be forecast, whereas for nonconcentrating systems primarily the global irradiance on a tilted surface is required. Efficient and renewable thermal comfort management inside buildings can be considered as a non-concentrating solar system.. For a more reliable modelling, the knowledge of the horizontal global solar irradiation is not sufficient; it is necessary to know the main solar components (normal beam and horizontal diffuse) mainly for two reasons: 1. To know the solar irradiation incident on each surface whatever the orientation and inclination are; the global irradiation cannot be measured in each building area and the received solar irradiation can be computed from beam, diffuse and global components; 2. To calculate more accurately the heat exchanges (diffuse and beam solar irradiance being different influences). Don't forget that the forecasting tool can be used for estimating the future thermal but also electrical production (by photovoltaic systems). The choice of the forecasting methods depends on the forecasting time horizon and the temporal resolution as seen in Fig. 1.   Fig. 1. Forecasting time horizon versus temporal resolution [5].
The existing methods can be classified in four different sets [5][6]: • Time series based methods: this set holds approaches based on statistical models solely ground on past measurements; • NWP (Numerical Weather Forecast) based methods: this set holds approaches based on weather forecasts provided by a specialized provider; • Satellite imagery based methods: this set holds approaches based on images of the earth taken by satellite; • Sky image based methods: this set holds approaches based on observations of cloud cover from the ground with an in situ camera.
The objective of this paper being to forecast hourly solar irradiation from 1 to 6 hours ahead, we concentrated our attention on methods developed for now-casting. The most efficient existing methods for such a forecast for these time horizons are time series analysis, artificial intelligence methods and deep learning methods [7]. If forecasting methods are largely developed for global solar irradiation (GHI), those developed for Beam Normal Irradiation (BNI) are few in number and those for Diffuse Horizontal Irradiation component (DHI) are, to the best of our knowledge, practically non-existent.

Cleaning and filtering
An automatic quality control used in the frame of GEOSS project (Group on Earth Observation System of System) [8] is applied to the solar data. Before introducing the solar data into the machine learning process, the data must be cleaned and filtered. The data are filter out in order to remove night hours. The data near sunset and sunrise are generally not reliable (instrumental errors especially due to the cosine response and mask effect of the surrounding mountains), a preprocessing operation is applied based on the solar elevation: solar radiation data with a solar elevation is lower than 10° are removed [9]. 3 years of hourly data have been used in this study. After cleaning and filtering the total number of hourly data for each solar component (global, beam and diffuse) is 10559 (about 60% of the data were not used (2% for outliers data and 58% for sun height less than 10°).

Stationarization
Machine learning methods are efficient tools for forecasting time series with a stationary behavior. To make solar irradiation data stationary and to separate the climatic effect and the seasonal effects, the solar data are generally transformed in non-dimensional variables called "clearness index", and denoted kt, given by the ratio of the solar radiation on the earth to that outside the atmosphere and defined by equation (1) [10]: (1) with GHI the global irradiation at the earth's horizontal surface for a given location and G0 the global solar radiation on the top of atmosphere. It is the clearness index series kt that induces randomness, caused by the diversity of atmospheric components (dust, aerosols, clouds motion, and humidity) on the solar irradiation measured at earth 'surface. The extraterrestrial irradiation G0 can be efficiently replaced by the clear sky solar irradiation GCS taking into account the climatic conditions of the meteorological site; thus the clearness index is replaced by the clear sky index kg,cs defined by: with GHICS the Global Horizontal solar Irradiation in clear sky conditions. For the other components of solar radiation (direct and diffuse), similar index can be defined such as kb [11] and kd [12]:  [13] and simplified by Ineichen [14], the European Solar Radiation Atlas (ESRA) model [15] and the Reference Evaluation on Solar Transmittance 2 (REST2) model [16]. Thus, we decided to use the simplified Solis clear sky model [14], it allows to calculate GHICS, DNICS and to deduce DHICS by Eq. (4).

GHICS=DHICS+BNICS x cos z
is the zenithal angle calculated at the middle of the hour [10]. For a better accuracy, the monthly mean values of aerosol optical depth and water vapour column are used as inputs in the Solis model; these averaged values were calculated from data measured at the Pic du Midi site (180 km from Odeillo, our study site presented below and same altitude) available in the data basis AERONET (AErosol Robotic NETwork) for four years (2001)(2002)(2003)(2004) [17]. This clear sky model was validated for each month by comparison with experimental solar radiation data (GHI, BNI, DHI) measured in clear sky conditions. For illustration purpose, experimental and modelled solar irradiances by clear sky are plotted in Fig. 2 for 1 day in April and in September. A good concordance is noted between modelled and experimental curves; the diffuse solar irradiance by clear sky is always lower to the irradiance by partially or cloudy skies because this component is minimum when the sky is clear and maximum in cloudy conditions.

Choice of input data
The purpose of this paper is to predict the future solar irradiation (at different time horizons) based on the past observed data i.e. mathematically: A variable X with the symbol represents a forecasted data, without this symbol, X is a measured data. The solar data at future time step (t+h) X _(t+h) is forecasted from the observed data X measured at the times (t, t-1…, t-n); thus, the objective is to determine the value of n i.e. the dimension of the input matrix; to do it, an auto mutual information method [18] is used. The auto mutual information is a property of the time series, depends on each dataset and is characteristic to the degree of statistical dependence between Xt+h and Xt-i with 0≤i≤n. Another preprocess called k-fold sampling is used with the dataset [19]: it consists in dividing randomly the data set into a training data set (80%) and a test data set (20%); this process is repeated k times and the value of the reliability metrics given in this paper are the average value on the k-fold. Here k is taken equal to 10. Thus, the results are independent of the set of data used for the training because using only one data set (with its own statistical particularities) can reduce the robustness of the conclusions.

The meteorological site
The meteorological data GHI, BNI and DHI were provided by the PROMES laboratory (CNRS UPR 8521), located in south of France at Odeillo (Pyrénées Orientales, France, 42°29 N, 2°01 E, 1550 m asl), the station is located in the mountains, at about 100 km from the Mediterranean sea and presents often a high nebulosity (Fig. 3). The solar data are measured and stored with a 1 minute time interval of measurement. This meteorological station being situated in altitude, the climate is very perturbed, the rainfall continues to be present during the driest months and this station is classified according to the Köppen-Geiger classification in Cfb (i.e. hot temperate climate without dry period and temperate summer). Consequently, the variability of the solar radiation is high and its forecasting is all the more difficult to realize. This variability of the solar irradiation can be quantified thanks to various parameters, Voyant et al [20] tested some of them and deduced that the more significant was the mean absolute log return; the mean absolute log return was calculated for Odeillo and are, for the three components: -0.6109 for GHI -0.9945 for BNI -0.4732 for DHI It appears that the variability of BNI is higher than for GHI and DHI and should be more difficult to predict.

Brief description of the forecasting methods
Each forecasting methodology is described shortly in this paragraph: Scaled (or Smart) persistence, Artificial Neural Network and Random Forest. The first method uses a naïve model, easy to implement and requiring no training step i.e. no historical data set; it is generally used as a reference model in view to compare it with more sophisticated models in terms of accuracy. The second and third models belong to the family of machine learning methods, more complex to implement but generally more reliable too.

Smart-Persistence
The persistence model, the simplest forecasting model, assumes that the future value is same as the previous one (Eq. (6)). Persistence forecast accuracy decreases significantly with forecasting horizon [21].
with X=GHI, BNI or DHI (6) The smart persistence is an improved version of the persistence one taking into account the diurnal solar cycle: the clear sky solar radiation profile over the day is used [20]: with X=GHI, BNI or DHI (7) This smart persistence model is applied in this paper to the three solar components and used mainly as a reference model.

Artificial Neural Network (ANN): Multi-layer Perceptron
It is the more known and used machine learning method for forecasting purposes. ANN is a nonlinear approximator implementing a simple pattern of elements interconnected one another. The ANN used here is the Multilayer Perceptron (MLP) with feed-forward back propagation often used in solar forecasting estimation and prediction [22][23]. The hidden layer receives input data and send an output signal to the output layer. A neuron receives signals from other previous neurons or input data unidirectionally for a feed-forward MLP configuration (Fig 4).
For the k-th neuron of the hidden layer, a weight wk,j taken various values determined during the training phase, is linked to each input xj; An activation function f is applied to the weighted sum ( ) for calculating an output if this sum exceeds a given bias . This output is then distributed to other next neurons. A sigmoid function for hidden layers and a linear one for the output layer were taken as activation functions. For the regression of the time series Xt, the mathematical expression for a MLP with one hidden layer of m neurons, one output neuron and n input variables is a function described by: With X the input vector (n x 1) of clear sky index kg,cs,kb,cs or kd,cs, the output value corresponding to the predicted values of the model at horizon t+h, bk and bo the bias related to the hidden neuron k and to the output, and ωk,j the weights between the j-th measured input and the k-th hidden neuron. f is the transfer function of the hidden neurons, ω*k the weight between the output and k-th hidden neurons. The optimization of the MLP is made by the Levenberg-Marquardt learning algorithm: several configurations with a different number of hidden nodes in the hidden layer are tested (the number of hidden nodes varying between 3 and n+2) and the most efficient is selected. Once the clear sky index is forecasted , the value of the forecasted solar irradiation, ( or is obtained in multiplying by the calculated clear sky irradiation ( ( or
The binary RT method consists in an iterative split of the data into two groups according to some thresholds and rules [26]; it constructs a set of decision rules on the predictor variables [27] in view to partition the data into smaller groups with binary splits based on a single predictor variable [25]. For RT, the predictor and the threshold or grouping are chosen for maximizing the homogeneity of the corresponding values in the resulting groups. The homogeneity is calculated as the sum of variance of data within each groups, this variance being minimized [26]. Each group is then divided in two subgroups and so on. For each final group (called a leaf), the predicted value is the mean of the values belonging to the leaf. The procedure grows maximal trees and then techniques such as cross validation are used to prune the overfitted tree to an optimal size [28]. It appeared that the output error obtained by a single RT is due to the specific choice of the training data set [25]. Thus, for solving this problem, Breiman [29] proposed to grow several trees and to average their predicted values to yield a more stable final prediction. To avoid having to use too much data for creating several independent trees, samples of data are chosen randomly in the original data set. This method is called bagging (contraction of bootstrap aggregating). The complexity of the model is tuned with the number of bagged trees, and each individual tree is not pruned [25]. The bagged trees are not statistically independent and the variance of their mean cannot be indefinitely decreased [25] because they are built from the same data set. To reduce this problem, Breiman [30] added a randomization step to bagging, each split of each bagged RT is built in a random subset of the predictors [26]. Numerous trees are growing creating a random forest (RF). Each subset of data used to grow the tree is replaced in the dataset before growing the next tree. This randomization gives more robustness to the model and decreases the risk of overfitting. At the end, the responses are aggregated to make the forecast. For more precisions about the regression trees based models, the reader can refer to the references given in this section. A comparison of these three methods (RT, Bagged RT and RF) is given in Prasad et al [25] in term of strengths and limitations. RF is recognized as one of the most effective machine learning models for forecasting and will be used in this paper.

Statistical index for accuracy evaluation
In this paper, we use four error metrics: -The mean absolute error (MAE) is appropriate for applications with linear cost functions, i.e. situations where the costs resulting from a poor forecast are proportional to the forecast error: -The root mean square error (RMSE) is more sensitive to important forecast errors, and hence is suitable for applications where small errors are more tolerable and larger errors lead to costs that are disproportionate, as in the case of utility applications, for example. It is probably the reliability factor that is the most widely used: These errors are then normalized, the mean value of irradiation is generally used as reference:

Auto-mutual analysis results
The choice of the number of endogenous inputs was realized by the auto-mutual information method which determines the number of previous solar irradiation data used to predict the future solar irradiation at time h+1 to h+6. The auto-mutual method showed that the number of inputs (n in equation 5) for predicting GHI is 6, for BNI 7 and for DHI 10. Table 1 gives the values of the performance metrics calculated on the test data set (RMSE and MAE are given in Wh.m -2 ) for GHI. As the ranking of the model is almost always identical from a RMSE point of view or MAE point of view, (excepted for h+1) we only present in Fig. 5 the results in term of RMSE and nRMSE expressed in percentage. The smart persistence, a naive model, was used as a reference, this model has a good RMSE and MAE for a time horizon h+1 but its performances decrease rapidly with the time horizon. We note that the gap in term of performances between ANN and RF increases with the time horizon, from 2.92% at h+1 to 7.07 % at h+6 in nRMSE value. Table 2 gives the values of the performance metrics computed on the test data set (RMSE and MAE are given in Wh.m -2 ) for BNI. The results in term of RMSE and nRMSE are presented in Fig. 6 for BNI. The forecasting of BNI is more difficult and the performances of the models are less satisfying than with GHI, this is because BNI is more sensitive to meteorological conditions and because the beam radiation intensity is more rapid and of a greater magnitude as suggested by the higher value of the mean absolute log return characterizing the intermittency degree. One more time, RF is the most performant model whatever the time horizon is and the gap in term of nRMSE between ANN and RF passes from 4.11% at h+1 to 12.8% at h+6 justifying even more the use of RF for BNI forecasting than for GHI one.  Fig. 6. Comparison of forecasting models for various horizon in term of nRMSE (left) and RMSE (right) for hourly BNI. Table 3 gives the values of the performance metrics calculated on the test data set (RMSE and MAE are given in Wh.m -2 ) for DHI. The results in term of RMSE and nRMSE are presented in Fig. 7 for DHI.

Diffuse Horizontal Solar Irradiation: DHI
The metrics values are of the same order of magnitude as for BNI excepted for the smart persistence. The smart persistence presents bad results because the daily profile of the DHI by clear sky (taken into account in this model) is not as well defined as for BNI or GHI. As seen for the two other components, the gap in term of performances between RF and ANN increases with the forecasting time horizon from 5.91% for h+1 to 14.74% for h+6. The use of a forecaster using random forest method for the DHI component gives very correct results.  5 58.0 63.1 66.0 67.6 68.1  nRMSE 0.351 0.419 0.456 0.477 0.488 0.491  MAE  33.6 41.0 44.7 47.7 48.9 50.1  nMAE 0.243 0.296 0.323 0.344 0.353 0.357   Fig. 7. Comparison of forecasting models for various horizon in term of nRMSE (left) and RMSE (right) for hourly DHI.

Comparison
It is impossible to compare the performances of the models according to the solar component in term of absolute value of RMSE (or MAE) because the daily DHI, BNI and GHI are very different. Thus, we plotted in Fig. 8, a comparison of the performances in term of relative RMSE for the three models. For each forecasting horizon, the three first histograms correspond to the smart persistence, then ANN, then RF. As previously underlined, GHI is forecasted with a better accuracy compared with BNI and DHI. It is probably due to the fact that in GHI, the two components, DHI and BNI, have compensating effects (when diffuse increases, beam decreases) and the speed of variation of GHI is less important than for DNI. Concerning the performances of the forecasters for GHI, the relative low reliability is probably due to clear sky index which is higher than 1 and can reach high values. The accuracy obtained for BNI and DHI using ANN and RF are of the same order of magnitude; in contrast, the smart persistence is not adapted at all for forecasting DHI for the reason previously explained. With SP and ANN methods, DHI and BNI are predicted with a nRMSE nearly twice as high than for GHI, with random forest method this difference is reduced when the forecasted horizon increases and for (h+6) the accuracy obtained for DHI and BNI prediction is the same than for GHI prediction. It seems that random forest have for all these reasons is the best predictor.

Seasonal performances
It is interesting to observe the performances of the model season by season and thus to observe the impact of the meteorological conditions (and of its variability) on the accuracy of each model. As it is difficult to compare the performances of the models according to the season in term of absolute value of RMSE (or MAE) because according to the season the average hourly irradiation are different, only nRMSE values are given in Tables and  Figures.

Global Horizontal Solar Irradiation: GHI
The performance metrics for GHI are presented in Table  4. and in Fig 9. Whatever the forecasting horizon and the model are, the best results are obtained for summer then for winter; in summer, the occurrence of clear sky irradiations is higher and in winter, the occurrence of cloudy ones too; the solar irradiation during intermediate seasons are more difficult to forecast. In spring and secondly in autumn, the difference in performance term between the three models is the highest, thus when the variability of the GHI is high, the utilization of a more complex forecasting tool is necessary. Fig. 9. Comparison of forecasting models performances for various seasons in term of nRMSE for hourly GHI.

Beam Normal Solar Irradiation: BNI
The nRMSE for BNI are presented in Table 5 and Fig.  10. The same remarks can be made than for GHI concerning the more important difficulty to forecast BNI in spring and autumn. The gap between the worst and the best model is higher than in GHI case and it appears clearly that SP is not adapted to a BNI forecasting. Fig. 10. Comparison of forecasting models performances for various seasons in term of nRMSE for hourly BNI.

Diffuse Horizontal Solar Irradiation: DHI
The nRMSE for DHI are presented in Table 6 and Fig.  11. As noted previously, the smart persistence is really a bad predictor for DHI, it is probably due to the fact that DHI by clear sky, DHICS, is lower than DHI by cloudy sky and thus kd,CS is higher than 1 and can vary in a large range, perturbing this method application. For the other methods, the accuracy of the prediction stays in a correct range. Fig. 11. Comparison of forecasting models performances for various seasons in term of nRMSE for hourly DHI.

Conclusion
Three forecasting methods, smart persistence, artificial neural network (multilayer Perceptron) and random forest, were compared and tested on solar data measured in a meteorological site presenting a high variability. The objective was to predict the hourly solar irradiation for a time horizon from h+1 to h+6; these methods were applied on the three solar components: horizontal global, normal beam and horizontal diffuse. It appears that random forest method allows to predict these three components with a good accuracy: -nRMSE from 19.65% for h+1 to 27.78% for h+6 for GHI; -nRMSE from 34.11% for h+1 to 49.08% to h+6 for BNI; -nRMSE from 35.08% for h+1 to 49.14% for h+6 for DHI. The random forest method gives the best results and the improvement due to the utilization of RF in comparison of ANN is even more important that the forecasting horizon increases; the improvement in term of nRMSE (nRMSERF-nRMSESC) due to a RF use compared to a SP use is: -For GHI forecasting, +2.02% for h+1 to +17.13% for h+6; -For BNI, +3.3% for h+1 to 28.36 for h+6; -For DHI, +28.56% for h+1 to 34.13% to h+6. A seasonal study was realized and showed that the forecasting during spring and autumn is more difficult to realize than during winter and summer due to a higher variability of the climate on these periods. BNI and DHI are more complicated to predict than GHI: the BNI and DHI components are more sensitive to meteorological conditions than GHI one (for GHI, the two components, DHI and BNI, have compensating and smoothing effects (when diffuse increases, beam decreases) and the variability of BNI is more important with a higher speed of variation and a higher amplitude. For DHI, the fact that the clearness index is higher than 1 and can reach high values (contrary to BNI and GHI with a clearness index between 0 and 1) perturbs the forecasting process.