Machine Learning in Stock Price Forecast

This paper analyzed and compared the forecast effect of three machine learning algorithms (multiple linear regression, random forest and LSTM network) in stock price forecast using the closing price data of NASDAQ ETF and data of statistical factors. The test results show that the prediction effect of the closing price data is better than that of statistical factors, but the difference is not significant. Multiple linear regression is most suitable for stock price forecast. The second is random forest, which is prone to overfitting. The forecast effect of LSTM network is the worst and the values of RMSE and MAPE were the highest. The forecast effect of future stock price using closing price of NASDAQ ETF is better than that using statistical factors, but the difference is not significant.


Introduction
Machine learning [1] is the general term for a class of algorithms that attempt to extract hidden information from historical data and solve the problems of prediction or classification. At present, the research of machine learning in stock prediction mainly focus on two key aspects. One is that researchers improved the original algorithm and try to prove that the prediction effect of the improved algorithm is better than that of original algorithm for the future stock price [2,3]. The other is researchers analyze and verify which algorithm is the most suitable for the forecast of future stock price based on compassion of several machine learning algorithms [4,5,6,7,8,9]. The most commonly used algorithms in the stock prediction are neural network algorithms that include multiple linear regression [10] and Long Short Term Memory [11], support vector machines [12] and random forest [13;14]. The remaining algorithms [15] are usually used as complementary algorithms to improve and optimize the above algorithms.

Multiple Linear Regression
Multiple linear regression is also an algorithm that aims to calculate the parameters � , �. The specific steps are as follows. Firstly, we need to design a multiple linear regression mode, � � � � � ; Secondly, we will construct the objective function, � , � � arg �,� �� , � � � arg �,� ∑� � � � � � � , using the least squares method; Finally, the gradient descent method is used to train the objective function and the parameters of model � , � are obtained. Once the model is trained, it can be used to make relevant predictions.

Random Forest
Random forest algorithm can solve the classification and regression problems. In general, the Random Forest uses the Bootstrapping algorithm to construct N training sets from the sample data. It trains the sample data by the individual classification algorithm to obtain individual classifiers. If we want to solve the classification problems, the final classification result is voted on by the individual classifiers. If we want to solve the regression problems, the final classification result is the mean value of outputs of the individual classifiers. The random uniform sampling is used in the Random Forest algorithm. The weights of individual classifier are equals and each individual classifier can be generated in parallel. We choose Classification And Regression Tree (CART) as the algorithm of individual classifier in this paper.

LSTM
LSTM (Long Short Term Memory) is a type of recurrent neural network (RNN) that can solve the problem of long-term dependencies and can apply the previous information to the current task.
(1) The related concepts of LSTM 1) The cell state value � . � is the i-th cell state value. The process from state � to state ��� is called the � � 1� � �� process.
2) The output of function is a vector. Each element of is a number between 0 and 1. represent the weight that can decide the retained number of input information. 0 means "Don't let any information pass" and 1 means "Let all information pass".
3) The output of function is a vector. Each element of is a number between -1 and 1. However, is different from . It maps the value of the input information to the value between -1 and 1. It just change the form of input information and remain the property of input information.
(2) The mathematical expression of LSTM Firstly, the four variables that are � , � , � and C � � are calculated in LSTM network. 1) represents the output of the � � 1� � process, � represents the input of the � process. � and � represent the weight and bias.
Secondly, the cell state value � are calculated.
Finally, the mathematical expression of LSTM network can be obtained.
The mathematical expression of LSTM network is: process, � is the weight and can decide the remained number of � .

Standardized method of input data
In this paper, we used "Z-score" statistical method to transform the original data X � �x � , x � , ⋯ , x � � . This formula can be expressed as: , the mean value of is 0 and the variance value of is 1.

The types of input data
There are two types of input data in this paper, which are as follows.
Firstly, the first type of is the standardized data of the closing price of NASDAQ ETF. This type of data can be described as: ����� � � ����� , ����� � . ����� � � � , ⋯ , � , ⋯ �, where � � � ��� , ⋯ , ��� �, � represents the input data of the � day, ��� represents the closing price of the � � � � day, � 10. ����� � � � , ⋯ � , ⋯ �, where � represents the output value of the � day. The total number of data is 776. The time period of data is from June 1, 2016 to June 28, 2019. The number of training data accounted for 80 percent of total number. The number is 620. The time period of training data is from June 1, 2016 to November 12, 2018. The test data is divided into in-sample data and out-of-sample data. The in-sample data is selected randomly from training data, it accounted for 20 percent of total number. The out-of-sample data is the remained data that means the training data are removed form total data, it accounted for 20 percent of total number. The time period of out-of-sample data is from November 13, 2018 to June 28, 2019 and the number of out-of-sample data is 156.
Secondly, the second type of data is the standardized data of statistical factors that can be constructed by closing price, opening price, maximum price, minimum price and trading volume data of NASDAQ ETF. This type of data can be described as: D ������ � �X ������ , Y ������ �. X ������ � �X � , ⋯ , X � , ⋯ �, where X � � �x �� , x �� , ⋯ , x �� , ⋯ x �� �, � represents the input data of the � day. �� represents statistical factors of the � � statistic factor in the � day, � 55 . ������ � � � , ⋯ � , ⋯ �, where � represents the output value of the � day. The total number of data of statistical factor is 20075. The time period of data is from January 17, 2018 to June 28, 2019. The number of training data accounted for 80% of total number. The number is 16,115. The time period of training data is from January 17, 2018 to March 18, 2019. The test data is divided into in-sample data and out-of-sample data. The in-sample data is selected randomly from training data, it accounted for 20 percent of total number. The out-ofsample data is the remained data that means the training data are removed form total data, it accounted for 20 percent of total number. The time period of out-ofsample data is from March 19, 2019 to June 28, 2019 and the number of out-of-sample data is 3960.

Comparison of forecasting algorithms 4.1 Training model
Through training three classifiers (Multiple linear regression, LSTM networks, Random Forest) respectively by the training set ����� , we can obtain the stable parameters of three models. The detailed steps are as follows. Firstly, we process the original training data ����� by the classifier, and obtain the prediction value � ����� ; Secondly, we compare the prediction value � ����� with the label value ����� and update the parameters through repetitive iteration.

Assessment model
We use the two evaluation indicators that are RMSE and MAPE to analyze the forecast effect of the three algorithms and determine which algorithm has the better forecast effect. The construction methods of the two indicators are as follows: Where, � represents the predicted value of the � � � day, and � represents the true value of the � � � day. For these two indicators, the lower the value, the better the prediction effect of the model. When we forecast the stock price using closing price data of NASDAQ ETF within the samples, the result shows that the RMSE of multiple linear regression is 0.051 and the MAPE of multiple linear regression is 0.387. The RMSE of random forest is 0.036 and the MAPE of random forest is 0.144. The RMSE of LSTM networks is 1.420 and the MAPE of LSTM networks is 2.585. Therefore, we can see that random forest get the best forecast effect, the values of RMSE and MAPE are the lowest. LSTM networks achieve the worst forecast effect, the values of RMSE and MAPE are the highest. The forecast effect of multiple linear regression is between random forest and LSTM networks, the values of RMSE and MAPE are close to random forest. When we forecast the stock price using the data statistical factors within the samples, the results show that the RMSE of multiple linear regression is 0.095 and the MAPE of multiple linear regression is 0.735. The RMSE of random forest is 0.043 and the MAPE of random forest is 0.212. The RMSE of LSTM networks is 1.199 and the MAPE of LSTM networks is 5.256. Therefore, we can see that random forest get the best forecast effect, the values of RMSE and MAPE are the lowest. LSTM networks achieve the worst forecast effect, the values of RMSE and MAPE are the highest. The forecast effect of multiple linear regression is between random forest and LSTM networks, the values of RMSE and MAPE are close to random forest. When we forecast the stock price using the data of statistical factors outside the samples, the results show that the RMSE of multiple linear regression is 0.259 and the MAPE of multiple linear regression is 0.169. The RMSE of random forest is 0.362 and the MAPE of random forest is 0.309. The RMSE of LSTM networks is 1.126 and the MAPE of LSTM networks is 5.241. Therefore, we can see that multiple linear regression get the best forecast effect, the values of RMSE and MAPE are the lowest. LSTM networks achieve the worst forecast effect, the values of RMSE and MAPE are the highest. The forecast effect of random forest is between multiple linear regression and LSTM networks, the values of RMSE and MAPE are close to multiple linear regression.

Conclusion
Three supervised learning algorithms (Multiple linear regression, Random forest and LSTM networks) are used in this study to forecast the future stock price using closing price of NASDAQ ETF and statistical factors. The forecast results of the three algorithms show that multiple linear regression is more suitable for the prediction of future stock price. The random forest has the best prediction effect for the in-sample data, but the prediction effect for the out-of-sample data is not as good as the multiple linear regression, which means the random forest is prone to overfitting. LSTM networks both in-sample and out-of-sample data get the worst forecast effect, the values of RMSE and MAPE were the highest. The forecast effect of future stock price using closing price of NASDAQ ETF is better than that using statistical factors, but the difference is not significant.