Forecasting residential gas consumption with machine learning algorithms on weather data

- Machine learning models have proven to be reliable methods in the forecasting of energy use in commercial and office buildings. However, little research has been done on energy forecasting in dwellings, mainly due to the difficulty of obtaining household level data while keeping the privacy of inhabitants in mind. Gaining insight into the energy consumption in the near future can be helpful in balancing the grid and insights in how to reduce the energy consumption can be received. In collaboration with OPSCHALER, a measurement campaign on the influence of housing characteristics on energy costs and comfort, several machine learning models were compared on forecasting performance and the computational time needed. Nine months of data containing the mean gas consumption of 52 dwellings on a one hour resolution was used for this research. The first 6 months were used for training, whereas the last 3 months were used to evaluate the models. The results showed that the Deep Neural Network (DNN) performed best with a 50.1 % Mean Absolute Percentage Error (MAPE) on a one hour resolution. When comparing daily and weekly resolutions, the Multivariate Linear Regression (MVLR) outperformed other models, with a 20.1 % and 17.0 % MAPE, respectively. The models were programmed in Python.


Introduction
In recent years the European Commission has set ambitious CO 2 emission reduction targets. As a consequence, the Dutch government has been increasingly regulating the energy sector. The Netherlands aims to be free of natural gas use by the end of 2050 [1]. Therefore, the importance of diminishing natural gas consumption in dwellings is growing.
Gaining insight in the prediction of gas consumption in dwellings is critical to meet the Dutch government requirements. However, a vast number of research has been conducted towards energy consumption prediction in commercial and office buildings [2, 3, 4, 5, 6]. Thus far, little research has been conducted to predict gas consumption on single household level due to the difficulty of obtaining the energy use data due to data privacy [7]. From commercial and office building energy forecasting research, Alberto Hernandez Neto [3] concluded that deep neural networks outperform physical simulation models by 3-6 %. Furthermore, according to Jurado López [8], weather conditions have the highest correlation to gas use. It is worth noting that using multiple weather parameters as features in prediction models improve the accuracy of the model [3]. This paper focusses on comparing the accuracy of widely discussed machine learning algorithms on an hourly, daily and weekly resolution and ultimately, creating a model that can predict the natural gas use of dwellings as accurate as possible. Different models were made and their results were evaluated. This was done with a dataset containing nine months of the mean hourly gas consumption data from a total of 52 dwellings in The Netherlands, together with the mean hourly weather data of a nearby weather station. In this paper the models used as less features as possible, to minimize the computation power required, which enables the models to be able to run on computers with limited computer power in dwellings.

Methodology
This paper focusses on the evaluation of different machine learning models. Specifically the Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU) and Convolutional Neural Network (CNN). These deep learning algorithms have shown the best potential for this type of prediction [1,5,6,9,4]. In most of the previously listed references, the models were based on electricity prediction of commercial and office buildings. These algorithms are called complex models in literature. The accuracy of the models is dependent on the environment of the research and chosen features. Consequently, the models need to be verified on residential households because this environment is different from commercial buildings. Commercial buildings usually have a fixed energy use time schedule, while dwellings have more variation in its energy use pattern [8].
In addition to the complex models, Multivariate Linear Regression (MVLR) and Deep Neural Network (DNN) have shown promising results in comparison to their simplicity [5,10]. These models are called simple models in literature. The different models are discussed in their respective subchapters.
In this section the way the data was collected and processed is discussed. Insights are given on what machine learning methods were applied and which metrics were used for the evaluation of the models. The same metrics were used to evaluate the different machine learning models. Furthermore, this section clarifies the commonly used environmental settings for all the models. All the models were programmed in Python.

Data collecting and processing
The data used in this research was gathered by the OPSCHALER project from 2017-02 to 2017-12. It contained data concerning gas and electricity consumption. This information was obtained from the smart meters of 52 different dwellings, everything was anonymized and therefore the dwelling locations were unknown. The data acquisition period from the dwellings varied from one to nine months. The electricity data was sampled with a ten second resolution, whereas the gas consumption was sampled with a one hour resolution. These sampling periods are visualized in Figure 1. Due to an uneven and scarce distribution of the data, the mean of all dwellings at each timestamp was taken along with its standard deviation. This represented the mean gas use on the aggregated level, e.g. a block of 52 dwellings.
In addition to this, weather information obtained from the Royal Netherlands Meteorological Institute (KNMI) station in Rotterdam was used. This was the weather station most nearby to all dwellings. However, the distance to the most distant dwelling was 103 km. The weather information was sampled with a 15 minute resolution.
One of the goals of this research was to use as less features as possible, therefore a selection of the used parameters was made. Parameters referred to all the available variables from the original dataset, whereas features referred to the variables that were selected to forecast the target 'gasUse'. According to [12,13] air temperature and calendar related features were the most relevant to use when predicting gas consumption in the residential sector. In this research the target was called 'gasUse', being this the gas consumption in an interval of an hour and measured in [m 3 ]. As a baseline, the parameter which had the highest Pearson correlation coefficient with 'gasUse' was selected as a feature. In this case this parameter was the temperature . The other parameters that had a Pearson correlation coefficient with the temperature that suffixes | | < 0.1, were selected. Setting this threshold to | | < 0.1 minimized the influence of the features on each other. All the used parameters are visible in Table 1. The weather data was sampled with a 15 minutes resolution, whereas the 'gasUse' was sampled with a one hour resolution. To combine these two datasets into one, the weather data was down sampled to one hour by mean and combined with the 'gasUse'. Not a Number (NaN) values appeared in the intervals of time where the data acquisition system stopped gathering information, this was the case for the features and target. All NaNs were removed from the dataset.

Train, test & validation dataset
All models used the same train, validation and test dataset. A visualisation of the distribution of the train, validation and test dataset is shown in Figure 2. The first 70 % of the dataset was used as the train set. The test set containing the remaining 30 % was used for cross-validation of each model. During training of the neural networks, the last 20 % of the train set was taken as the validation set.

Model evaluation
Neural networks use an optimizer to minimize the loss function. This can be interpreted as the least squares method for linear regression which minimizes the squared residuals. The loss function used for the neural network models in this paper was the Mean Squared Error (MSE) and is defined as: ( Where � is the -th ground truth value, is the -th predicted value and is the total number of samples.
Two positive properties of the MSE is that in general it is relatively computationally inexpensive and is sensitive to outliers because the difference between and � is squared. This sensitivity to outliers had a positive influence on the used dataset because the outliers in the gas consumption represent valid data points, e.g. they were not corrupt data due to malfunctioning of the measurement devices.
Two other available loss functions are the Mean Absolute Percentage (MAPE) and the Symmetric Mean Absolute Percentage Error (SMAPE). MSE was chosen over MAPE and SMAPE because they are less sensitive for outliers. Where MAPE is defined as: (2) , and SMAPE as: Notice how the MSE is scale dependent whereas MAPE and SMAPE are in percentages. The MAPE and SMAPE were used as evaluation metrics, together with MSE to determine the performance of each model.

Optimizers and learning rate scheduler
Like stated in chapter 2.3, neural networks use an optimizer to minimize the loss function. In this research Adam [15] and Nadam [16] were used for the models. The used optimizer is specified per model in their respective subchapters. The main difference between Adam and Nadam is that Adam is essentially RMSprop with momentum whereas Nadam is Adam RMSprop with Nesterov momentum. This is simplified by imagining a curved plane in ℝ 3 . When trying to get to the lowest point of this plane, Nadam will jump over hills more quickly than Adam would by default. Adam and Nadam were chosen instead of the commonly used Stochastic Gradient Descent (SGD) method, because they converge quicker than SGD does [17].
To improve the rate of convergence of the loss function, a cosine annealing learning rate scheduler with periodic restarts was applied to the optimizer. Together with improving the rate of convergence, this also gives the ability to find a lower and wider minimum [18].
Regularly, without the learning rate scheduler, the learning rate is set to a fixed value for each batch within the total amount of epochs . Whereas with the learning rate scheduler, the learning rate is changed per batch within the -th run as follows [19]: Where and are the ranges for the learning rate and are the number of epochs since the last restart.

Feature and batch normalization
The features were standardized by removing the mean and scaling to unit variance using the 'StandardScaler' function from scikit-learn. This function scaled all feature samples by removing the mean and scaling to unit variance. This prevents one or more features dominating the others. The models also converge less quick and had a likelihood to have a lower accuracy when the features are not scaled. Each feature from the features train dataset were scaled independently. The standard score of the feature sample was calculated as [20]: Where is the mean value of the sample and is the standard deviation from the sample. The standard score was stored and used to also transform the features from the test dataset. The standard score was not being recalculated for the test set. This prevents having data leakage from the distribution of the feature samples from the test-to the train-set.

One hot encoding
The hour of the day, day of the week and current season values were extracted as features from the timestamp of each row in the dataset. Where hour of day ranged from 0 to 23, day of the week from 0 to 6 and season from 1 to 4. These features were one hot encoded. This transformed the data so each value of the feature had a separate column in the dataset, as for example this matrix: Where the day of the week is represented as a number in the column vector. After transforming this column vector to the 3 × 3 matrix, each column represents a day of the week. With 1 representing that at given row is currently that day of the week. Where column one represented that row being on a Monday, column two being on a Tuesday and column three being on a Wednesday. Representing the features extracted from the timestamp in this way allows the models to assign different weights to for example 07:00 AM on a Monday and 09:00 AM on a Saturday.

Architecture evaluations
Hyperas was used to evaluate different architectures of each neural network model. A commonly used distribution of nodes and layers was set as the available parameter space. Hyperas was set to evaluate a fixed amount of possibilities from the parameter space. The amount of evaluations was chosen so the total evaluation time per model took 24 hours. During each evaluation Hyperas trained a different architecture for a specific number of epochs. In the end the best performing architecture was used for the models in this paper. The amount of evaluations per model was defined in Table 2.

Initial neural network architecture setup
All neural network models were programmed in Python using the Keras library with a TensorFlow backend and were trained on a NVIDIA GeForce 960m GPU. The weights of each layer were initialized by a truncated normal distribution. The bias from each dense layer was turned off because each layer was followed up by a batch normalization layer, apart from the output layer. This batch normalization layer normalized the weights like described in chapter 2.5 and was also applied after recurrent and convolutional layers. Finally, each layer apart from the output layer was followed up by the Leaky version of a Rectified Linear Unit (LeakyReLu) activation function, which is defined as [21]: Where ( ) is the weight vector for the -th hidden node and is the node input.

MVLR
MVLR was the simplest model used in this research.
Despite being a simple model, it has been often used for the energy forecasting and it is known to make relatively accurate predictions [22,8]. The combination of the performance and simplicity made the evaluation metrics from MVLR the baseline to compare other models results with.

DNN
The next model used was a feed-forward DNN, this is one of the basic types of neural networks due to all connections going in one direction without cycles or loops. The data from the input layers are passed on to the next layers of nodes (hidden layers) and based on the weights, offset and activation function they compute a value per node, which is passed on to the next layer until the output node is reached. The value of the output node will then output a prediction . One of the main benefits of feed-forward DNN is the ability to adapt to non-linear relationships, in contrast to MVLR. Figure 3 contains a schematic of the used DNN architecture. The input shape of this model was a matrix of shape (features), containing the features of the current hour, to predict the next target value of shape (target). Where (features) when using five features would be (5), e.g. a five-dimensional row vector. Simplified, the weather information from the current hour was used to predict the gas consumption of the next hour. Nadam was used as the optimizer while training this model.

LSTM and GRU
LSTM and GRU are based on the Recurrent Neural Network (RNN), which are often used for natural language and text processing [23]. LSTM networks are different from RNNs by the ability to store historical information which has been processed in its internal memory units, which can be an advantage when using time series data [6]. Compared to LSTM, GRU networks use less parameters per node and thus can be interpreted as a simplified version of the LSTM model [24].
The input shape of this model was a matrix of shape (timesteps, features), containing the features of all the timesteps, to predict the next target value of shape (target). Timesteps can be interpreted as the model being able to look back a specific number of hours. For both the LSTM and GRU models this was set to 120 hours. Simplified and summarising, the weather information from the past 120 hours was used to predict the gas consumption of the next hour. The architectures used can be seen in Figure 4. Adam was used as the optimizer during training of the LSTM model, whereas Nadam was used for GRU. Each layer configuration being repeated times behind each other is represented by × , where is the layer number. For LSTM 2 , … , 6 are equal to 0, 1, 3, 2, 1 and for GRU are equal to 1, 1, 0, 4, 1 respectively.

CNN
A CNN is a type of deep neural network, most commonly used for image recognition. Partly due to the development of autonomous cars, image classification, facial recognition and more, CNNs are one of the most advanced neural networks currently being developed in computer science [25,26,27]. Compared to the previously discussed networks, a requirement to apply CNNs was the addition of a channel dimension to the feature matrix used in the RNN model. This changed the matrix from shape [timesteps, features] to [height, width, channel]. Where channel is the colour dimension, three for RGB images and one for grey-scaled images. Timesteps and features were interpreted as the height and width of the input image. The architecture used can be seen in Figure 5. The model was trained with the Nadam optimizer.

Time distributed CNN + RNN + DNN
The Time Distributed model consisted of a time distributed CNN layer, being followed up by an LSTM and finally a DNN. This model combined the power of all three models. An image with (timesteps, columns) of (120, 39) was fed into the input layer. This image was then reshaped to 24 smaller images of (5, 39, 1). Where 1 is the channel dimension required by the CNN.
Each smaller image was fed to the CNN and the flattened CNN output was saved in memory. These 5 flattened outputs made up the sequence that was fed into the RNN of shape (5, flattened CNN output). From here on the RNN and DNN were applied as explained in their respective subchapters. Nadam was used as the optimizer to train this model.

Figure 5.
Where α 1 is equal to 5, α 2 is equal to 8, (β 1,1 × β 1,2 ) equals (8 x 4) and (β 2,1 × β 2,2 ) equals (10 x 8). The final output of the CNN is flatted and fed into a DNN where ψ 1 … ψ 3 equals 64, 128, 256 and Ψ 1 … Ψ 3 are equal to 2, 2, 1 respectively. Table 2 shows that on one hour resolution, the DNN model performed best with a 50.1 % MAPE, while LSTM had the lowest performance with a MAPE of 139 %. In comparison to MVLR, DNN outperformed MVLR because of its ability to adapt to non-linearities. DNN outperformed the other deep neural network models when comparing it to the other deep neural networks. Due to the 24 hour limitation on the architecture evaluations and number of epochs, DNN outperformed the other models thanks to its simplicity. A probable reason for LSTM and GRU performing worse than expected is the presence of NaNs in the dataset. When NaNs are removed, missing timestamps affect to the periodicity of data and therefore could have an influence on the accuracy of the model. Furthermore, this could explain why one hot encoded features such as hour of the day, day of the week and season leads to performance gains. This is explained by the LSTM and GRU models adapting to a pattern of a fixed periodicity between the time steps.

Results
When comparing one day and one week resolutions, the results indicated that MVLR model outperformed the other models. MVLR had a MAPE of 20.2 % and 17.0 % , whereas LSTM was the lowest performing model with a MAPE of 99.7 % and 95.0 % on a one day and one week resolution respectively. When down sampling the data from one hour to lower resolutions, the cumulative error gets reduced because of the surface area of the errors getting smaller. Figure 6 shows that the models tended to forecast systematically below the real values. Furthermore, during summer MVLR and DNN outperformed LSTM, GRU, CNN and Time Distributed. During winter, the difference between the real and the forecasted values became larger.

Conclusion
This paper compares several machine learning algorithms to forecast the mean gas consumption of 52 dwellings, representing a block of dwellings on the aggregated level. Forecasts were done with an hourly resolution by using the wind speed, rain intensity, temperature, season, hour of day and day of the week as features. The feature selection was made by using as few features as possible, with the objective of keeping the computation power required as low as possible. The choice of models was based on previous research studies which were mainly focused on forecasting the gas and electricity consumption in commercial and office buildings. Furthermore, three types of deep learning models were combined into a single model called Time Distributed. Time Distributed combined the potential of CNNs and RNNs, with the goal of getting a better performance regarding the outcome of the results. More specifically, this was a time distributed CNN followed up by a LSTM and DNN.
To validate the applicability of each model, the models were compared on performance and computational time required. The gas consumption data of the mean of 52 dwellings was split into a training and test dataset of 70 % and 30 %, respectively. Predictions were cross validated on the test set with an hourly resolution. To evaluate the performance on multiple resolutions, the hourly predictions were down sampled to one day and one week resolution by summation.
As seen in Table 2, DNN performed best on an hourly resolution when looking at the MAPE. On a daily and weekly resolution, MVLR outperformed the other models. In all resolutions, LSTM had the lowest performance.
Further studies should focus on exploring the possibilities of getting more accurate results and applying the models on individual dwellings. One way the evaluation metrics could be improved is by using more data, e.g. a sampling period of full-year or more. This is substantiated on the variance between the validation and train loss. Improving the amount of training data could help with the recognition of human patterns and dependency on outside weather conditions. Alongside this, more features like the electricity consumption can be used to improve the accuracy of the deep learning models. However, this results in an increase in computational power needed, which can be a drawback in certain situations, e.g. when the hardware used has insufficient computational power. In the case of this research, as less features as possible were used. In addition to the previously stated recommendations, increasing the amount of architecture evaluations, along with the number of epochs of the final model architecture, can lead to a better performance.