Human Activity Prediction using Long Short Term Memory

Early symptoms of dementia is one of the causes decrease in quality of life. Human activity recognition (HAR) system is proposed to recognize the daily routines which has an important role in detecting early symptoms of dementia. Long Short Term Memory (LSTM) is very useful for sequence analysis that can find the pattern of activities that carried out in daily routines. However, the LSTM model is slow to achieving convergence and take a long time during training. In this paper, we investigated the sequence of actions recorded in smart home sensors data using LSTM model, then the model will be optimized using several optimization methods. The optimization methods were Stochastic Gradient Descent (SGD), Adagrad, Adadelta, RMSProp, and Adam. The results showed that using Adam to optimized LSTM is better than other optimization methods.


Introduction
Human activity recognition (HAR) is one of the efforts that can be completed in developing an intelligent environment that can be utilized for monitoring daily activities [1]. HAR has an important role in human interaction or interpersonal, because it can provide a better understanding of one's personality and psychological condition [2]. Generally, HAR is used to identify abnormal behaviour [3]. According to historical data, abnormal behaviour is defined as a decrease in physical activity in carrying out daily routines [4]. Abnormal behaviour is refers to abnormal activity that caused by cognitive decline and it cause early symptoms of dementia at 30-50 years old [5].
Early detection of dementia is refers to finding the pattern of activities that are incompatible with normal or abnormal behaviour, such as finding a pattern of dementia patients. However, availability of the real world data of dementia patients is a big challenge for non-medical diagnosis of the patient [6]. Early symptoms of dementia can be identified by monitoring behavioral variations such as disturbance in sleep, difficulty in walking and inability to complete an activity. Such variations can provide important information about memory, mobility, and cognition of a person, so they can find a pattern of activities that carried out in their daily lives [7]. The research area of HAR shows that modeling activities sequentially over time or time series modeling can be used as indicators to early detection of dementia [8].
Long Short Term Memory networks, usually just called LSTM is a special kind of RNN, capable of learning long-term dependencies. They work very well on a large variety of problems, and are now widely use in sequence analysis [9]. The author of [10] analyzed the sequence of actions recorded in smart home environment with sensors by introducing the concept of LSTM model that allows us to predict the user's next actions and to identify anomalous user behaviour. The research [7,8] shows that their experiment uses LSTM to modeling activities have a clear advantage in accuracy. However, the LSTM training method is slow to achieving convergence and take a long training time [11]. One way to overcome this issue is to apply optimization methods. Gradient based algorithms use the gradient of the performance function to determine how to adjust the weights to minimize the performance [12], and it is aim to improve the performance, the accuracy and training time [13].
In this paper we investigated activity recognition using LSTM networks model on publicly available smart homes dataset [14]. Using several optimization methods, such as SGD [15], Adagrad [16], Adadelta [17], RMSProp [18], and Adam [19], as optimization methods to optimized the LSTM model for time series analysis is good idea to get a good results. Such variations of the optimization has been done to optimized Neural Network with MicroRNA as a feature, and showed very good results [20]. Therefore, in this study such variations of the optimization method were used to optimized the LSTM model and the results were compared to find out the best optimization for activity prediction.

Word Embedding
A lot of word representation techniques are used in machine learning and deep learning area and generally use one-hot encoding. However, the disadvantage of using this type of representation is not possible to compute how similar words are. The solutions is to use word embedding [21]. Word embedding have the ability to generalize words that have semantic similiarities, capture semantic meaning, and able to achieve similar words have the same vectors, which is not in one-hot encoding. Hence, word embedding can better model semantic complexity. In our model, we use Word2Vec representation to define each action's embedding matrix.

Long Short Term Memory
Long Short Term Memory units, or LSTMs, proposed by [22] is a variant of Recurrrent Neural Network architecture. LSTM has the advantage to solving sequential data problems, especially in time series data. The LSTM networks has cell memory blocks. Each cell contains gates: an input gate, a neuron with self-recurrent connection, a forget gate and an output gate. The cell operates on the input sequence, each gate uses sigmoid activation, where the output of sigmoid activation is between 0 and 1, indicating whether the information will be forwarded or stopped. In brief, a sigmoid activation is to control whether they are triggered, making the change state and the addition of information flowing through the cell conditional. The procedures of LSTM network at time step are as follows [22]: (a) The first step LSTM is to decide what information we're going to throw away from the cell state, previous cell state #$% in the forget gate # . # = ( + # + + ℎ #$% + + ) (1) (b) Decide what new information of input # should be stored in the cell state # , where # and 3 # should be updated.
(c) Update the old cell state, #$% , into the new cell state # which combine the candidate memory 3 # and the old cell state #$% .
Decide what we're going to output ℎ # in the output gate where the output information # and the cell state # are used.
where is the input weight, is the recurrent weight, and is the bias. The terms f, i, o represent the forget gate, input gate, and output gate. The terms # is the input information to the memory cell layer at time step , ℎ #$% is the result calculated by previous hidden state and ℎ # is the output of hidden state at time step t. being the sigmoid wich enables the valueable between 0 and 1, then tanh is the hyperbolic tangent activation function which squeezes the value between -1 and 1. Both of the two activation functions are expressed as: The steps above are feedforward training process. In order to perform the backpropagate steps, it is necessary to calculate the derivative of each gate at timestep as follows [22]: 15) where ∆ # is the derivative of error it formally corresponds to LM LN , but not including the recurrent dependencies [22]. The term + 1 indicates that need to compute the gate operation from next time step. In the last steps, the gradient for the weights are calculated as in equation (17), (18), and (19), where # is the gradient from all gates at time step, is the gradient of weight, is the gradient of recurrent weight, and is the gradient of bias:

Optimizer
Optimizer is an algorithm used to change the attributes or parameters of neural network in order to reduce the error. In neural network, the goal is to optimize the nerwork with reduce the difference between the predicted output and the actual output. In this work, the backpropagation through LSTM are using Adam and SGD, then the results obtained will be compared.

Gradient Descent
Gradient descent is a first-order optimization algorithm and the most used optimization algorithm in neural network. The algorithm works by updating the parameters which are to be optimized to gradient of the value function. Stochastic gradient descent or SGD, is a variant of gradient descent optimization method that each iteration minimizes the error by considering the error for each data point [24]. The iterations of SGD can be described as [15]: where T is iterate, T is a learning rate, and ∇ W ( T ) denotes the stochastic gradient computed at T .

Adagrad Optimizer
Adagrad is an algorithm for gradient based optimization and modified SGD algorithm with per-parameter learning rate, it means adapting the learning rate to the parameters. Adagrad can improved the SGD algorithm it is because increases the learning rate for sparses parameters and decreases the learning rate for ones that are less sparse [16]. The update rule for Adagrad is as follows:

Adadelta Optimizer
The main idea of Adadelta is from Adagrad algorithm that seeks reduce its aggressive which the learning rate continuously decreases in each iteration [17]. Adadelta is an extension of Adagrad that adapts learning rates based on a moving window of the gradient instead of accumulating the sum of squared gradients over time [17]. Compared to Adagrad, Adadelta does not have to set an initial learning rate. The steps of Adadelta is as follows: 1. Calculate the gradient: 2. The resulting parameter update: 3. Accumulate update values: 4. Apply the updates:

RMSProp Optimizer
RMSProp is an adaptive learning rate method that has found much success in practice and the method was proposed by [18]. RMSProp was developed to overcome Adagrad's issues, which the learning rate countinously decrease in each iteration. The steps of RMSProp is as follows: where the term is to be set to 0.9 and for the learning rate value is 0.001.

Adam Optimizer
Adam is an adaptive learning rate method which computes individual learning rates for different parameters. Adam is derived from adaptive moment estimation, it is because the algorithm uses both first and second moments of the gradient to update the learning rate. Adam method combines the advantages of two methods, AdaGrad and RMSProp. The method first updates the exponential moving averages of the gradient # and also the squared gradient # , which is the estimates of the first and second moment. The Adam optimization steps as follows [19]: 1. Initialize the decaying rate % = 0.9, H = 0.99, and = 10 $k , initialize time t = 1.

Calculate the gradient # at time t:
3. Calculate the first moment: 4. Calculate the second moment: 5. Calculate the bias correction for the first moment: 6. Calculate the bias correction for the second moment: 7. Update the parameters:

Evaluation of Model Performance
To evaluate the performance of our model, the evaluation used is top-k categorical accuracy. The formula for accuracy is expressed as [10]: where is the number of queries, 4 is the expected action, 4 T is the top-k predicted actions and is usually set default from 1 to 5.

Implementation
In order to create a time series model for activity prediction, we have used a LSTM network for the proposed prediction architecture, as shown in Figure 1. The process consists of data acquisition, word embedding representation, LSTM model for activity prediction, then using several optimizstion methods during training to optimize the LSTM network, and we have a model prediction that will be evaluated. In our proposed architecture, the acquisition of data was carried out using data that was already present in previous studies, as explained in the next section. The data is emanated from monitoring activities using sensors and further labelled as the sequences of activities. These labels are then fed to the embedding layer and represented as Word2Vec vectors. These vectors are then processed by LSTM as sequence modelling module, then the model will be optimized with several optimization methods to help get a good result. Finally, the results obtained will be evaluated.

Results and Discussion
In order to implement the prediction model, the LSTM algorithm was implemented using Keras Deep Learning library's. The experiment consists of 5 training experiments with several optimization methods, as show in Table 1. The LSTM model was built with 1 LSTM layer of 512 neurons, 2 dense layers with different activation function for each experiments with dropout 0.8, and in the last layer is 1 dense layer with softmax activation. For each optimization experiment, the number of epochs trained are 100, 300, and 500, and for the learning rate are 0.001, 0.01, 0.05, and 0.1. The dataset divided into training dataset, and validation dataset. The data used as training is 80% of whole dataset, and 20% for validation dataset. The training was performed using Nvidia Geforce GTX 1050Ti.  Tables 2-6 show the results for each optimization experiment. All experiments use the same LSTM architecture, same learning rate, and same the number of epochs, as explained previously. In the case of the first experiment (Exp-1), table 2, shows that using Adagrad optimizer with sigmoid activation function achieved the best accuracy at top-5, with learning rate value is 0.01 at 300 epochs.
In the case of the second experiment (Exp-2), table 3, shows that using Adadelta optimizer with Tanh activation function achieved the best accuracy at top-5, with learning rate value is 0.05 at 300 epochs.
In the case of the third experiment (Exp-3), table 4, shows that using SGD optimizer with ReLU activation function achieved the best accuracy at top-5, with the learning rate value is 0.01 at 500 epochs.
For the fourth experiment (Exp-4), table 5, shows that using RMSProp optimizer with ReLU activation function achieved the best accuracy in almost top-k with learning rate value is 0.1 at 300 epochs.
Finally, in the last experiment (Exp-5), table 6, the best results were achieved using Adam optimizer with ReLU activation function with the values of 80.34% at top-4 and 86.32% at top-5, with learning rate value is 0.1 at 500 epochs. Based on the results, Adam has the highest accuracy. However, the accuracy results can be influenced by several factors, such as data used, good preprocessing results, good architecture used, good tuning parameter, etc.

Conclusion
In this paper, we showed a study of human activity prediction using LSTM as a prediction model. The proposed metholodogy consists of five steps, the acquisition of data, word embedding representation, LSTM layer, optimization, and evaluation of the model. Based on the results showed that using Adam with ReLU activation function to optimized the LSTM model works well and produced the best accuracy compared than other methods. Adam achieved the best accuracy value of 86.32% at top-5, with learning rate value is 0.1 at 500 epochs. By this result, there is still a potential improvement for the accuracy. The combination of Adam and SGD by switching from Adam to SGD during middle of training to optimized LSTM is an excellent idea for future work, because Adam showing huge performance gains in terms of speed of training, and switching from Adam to SGD in the middle of training, they've been able to obtain better generalization power than when using Adam alone [15].