Data verification for forecasting of building energy consumption

. One of the necessary procedures prior to monitoring the energy consumption of buildings is the procedure of verification of the initial data on energy consumption. It is always applied in cases where there is even the slightest doubt in the correctness of the initial data. The article deals with the verification of data on energy consumption in buildings, which includes the implementation of procedures for detecting and replacing null, erroneous, absolutely equal, and recovery of lost data on energy consumption. To solve this problem the article presents a number of methods included in the program "Statistical data verification" developed in MATLAB environment. As an example, the process of verification of data on energy consumption in buildings, which include hospitals, polyclinics, rural health centres is presented.


Introduction
One of the necessary procedures in the management of energy consumption of buildings [1,2] is the procedure of verification of the collected database on energy consumption.Verification should always be applied if there is even the slightest doubt in the correctness of the initial data.Verification is a logical-methodological procedure for detecting and replacing null, erroneous, absolutely equal data, as well as recovering lost data.By no means always the initial data turn out to be quite correct.This greatly reduces the reliability of such procedures of buildings energy management as balance modelling, forecasting [3,4] and rationing [5].The preliminary analysis of the collected data on energy consumption of more than 4000 buildings (hospitals, polyclinics, rural health centres) has shown, that not always the initial data used for calculations, are quite correct.They contain zero or absolutely equal data, as well as the so-called "outliers", which is the reason of incorrectness of the base.In some cases, it is necessary to increase the existing database for several years "backwards".Consequently, a preliminary verification of the energy database is required.The main purpose of this work is to present methods of verification of data on energy consumption in buildings, as well as the use of, developed in MATLAB program " Statistical data verification" to solve this problem.

Methods
The main purpose of data verification is to correct the database obtained from the collection of information on the energy consumption of social facilities.Data verification involves performing the following procedures: 1) Zeros (null data) elimination; 2) Outliers (erroneous data) elimination; 3) Absolutely equal data elimination; 4) Recovery lost data.

Zeros elimination
Zero data is the first indication that the base is incorrect.The fact is that a really operated object cannot consume zero amounts of energy in any way.The presence of them in the database can only be due to their absence during the collection of information on individual objects.In order to eliminate the zero data, they are first searched for, and then, if they are present within the database, they are restored by interpolation, and at the border of the database -by extrapolation.Interpolation and extrapolation are performed in 5 ways: 1) splines; 2) 3-degree polynomials; 3) cascade forward neural network with 10 neurons in the hidden layer; 4) feed forward neural network with 8 neurons in the hidden layer; 5) mean of the four methods.Elimination of null data can be performed on both annual and monthly data.

Outliers elimination
Erroneous data (outliers) are atypical data that deviate significantly (without a correct explanation) from the centre of the distribution.Outliers appear due to gross errors in logging measurements, equipment failures, measuring in wrong units, etc.Also, a very common cause of outliers is an error in data entry into the computer.
To check and determine outliers, we used six statistical methods: 1) Z-score method.According to this method, Z-scores are defined as: Data with |Zsc| greater than 3 are outliers [6], although the criteria may vary depending on the data set.
2) Modified Z-score method.On this method we replace   by the sample median   , and s by the MAD (Median of Absolute Deviations about the median).According to this method, modified Z-scores are defined as: If |  ()| > 3.5, the data are considered outliers [7].
3) Interquartile range (IQR) method.On this method IQR is determined by the formula  =  3 −  1 , where Q1 is the first quartile and Q3 is the third quartile.
According to [8], the probability that one or more observations will be considered outliers at k ∼ 2. The widely used value of k is 1.5.Observations outside the IQR bounds are considered as outliers.4) Adjusted IQR method.The idea of the method is to calculate the limits of the "confidence interval" taking into account the asymmetry of the distribution, but so that for the symmetrical case it is still equal to 1.5 * IQR [9].To determine some "asymmetry coefficient" we used the medcouple function (MC), which is defined as follows: The adjusted IQR bounds lie at a distance [9]: The value of k is 1.5.Observations outside the adjusted IQR bounds are considered as outliers.5) Grubbs test.This test is known as the maximum normed residual test [10].The Grubbs' test statistic is defined as: where      denoting the sample mean and standard deviation, respectively.The procedure for their calculation is presented in 1.For the two-sided test, the hypothesis of no outliers is rejected if where  α/2n,n-2 2 -critical value t distribution with (n-2) degrees of freedom and a significance level of α/(2n).6) Dixon test.Dixons Q test identifies whether the smallest or largest data value is an outlier or not.This test is used for samples of small size (n≤ 30).Find the Q statistic using the following formula: where  1 -value of the smallest observation,   -value of the largest observation,  −1 -value of the second largest observation.Then we find the   value.If  >   we reject the corresponding data point as an outlier.Data are considered as outliers if any five of the six methods are met.Further outliers' elimination is performed similarly to the zero's elimination, i.e. on the basis of interpolation and extrapolation 5 methods presented above.

Absolutely equal data elimination
The presence of absolutely equal data is also one of the signs of incorrectness of the base, which is obvious even in terms of physical sense.The fact is that it is very difficult to find two different objects, remote from each other, with different modes of operation, but with absolutely equal energy consumption.To eliminate absolutely equal data, first of all their search is performed.If available, they are eliminated by adding relatively small values (Δ=0.1) to the original data.

Recovery lost data
Recovering lost data is optional among these four procedures.It should be performed in case of missing data on individual objects.Recovery of the lost part of the database can be done by extrapolation for each object.This is essentially a forecasting task, the other way around, where present data is used to determine past data.The data are extrapolated in the 5 ways given above.At the same time, you should specify the number of years for which you want to restore the database.

Results
The implementation of data verification procedures using the program " Statistical data verification", consider the example of electricity consumption of 150 buildings (hospitals, clinics, rural health centres) located on the territory of Tashkent region.

Zeros elimination
As noted above, the first step is to check for null data in the source database.The check showed the presence of 28 null data on five buildings.For example, there are 2 null values in the data of building No.6 (Fig. 1).The null data replacement is performed by a forward propagation neural network with 10 neurons in the hidden layer.

Outliers elimination
The results of the checking identified 68 erroneous data in 61 objects.For example, there are 3 outliers in the data of building No.9 (Fig. 2).Outliers replacement is performed by a forward propagation neural network with 10 neurons in the hidden layer.

Absolutely equal data elimination
The results of the checking identified 19 absolutely equal data (Fig. 3).They are eliminated by adding relatively small values (Δ=0.1).

Recovery lost data
In the database, restore should be performed on building 4 (Fig. 4).The data for the first year should be restored.The distribution of the annual value by months is performed randomly.
where,   is the actual value;  , is the value obtained from the ANN model.The results of assessments of the impact of verification procedures on the quality of energy prediction models for randomly selected six buildings are presented in Table 1.

Conclusions
It is not always the initial data that turn out to be quite correct.They contain zero, erroneous (outliers), absolutely equal data, which greatly reduces the quality of the models used to predict the energy consumption of buildings.In some cases, it is necessary to restore the lost data.Consequently, prior verification of the existing energy consumption database of buildings is required before performing forecasting.The use of verification improves the quality of the models for the energy forecasting of buildings.So, replacing only zero data increased the quality of the models on average 9 times, replacing zero data and emissions increased the quality of the models on average 28 times, replacing zero data, emissions and absolutely equal data increased the quality of the models on average 27.7 times.
Recovery of the lost part of the database

Fig. 3 .
Fig. 3. Search and replacement of absolutely equal data

Table 1 .
The results of assessments of the impact of verification procedures