A Comparative Study using Feature Selection to Predict the Behaviour of Bank Customers

Though banks hold an abundance of data on their customers in general, it is not unusual for them to track the actions of the creditors regularly to improve the services they offer to them and understand why a lot of them choose to exit and shift to other banks. Analyzing customer behavior can be highly beneficial to the banks as they can reach out to their customers on a personal level and develop a business model that will improve the pricing structure, communication, advertising, and benefits for their customers and themselves. Features like the amount a customer credits every month, his salary per annum, the gender of the customer, etc. are used to classify them using machine learning algorithms like K Neighbors Classifier and Random Forest Classifier. On classifying the customers, banks can get an idea of who will be continuing with them and who will be leaving them in the near future. Our study determines to remove the features that are independent but are not influential to determine the status of the customers in the future without the loss of accuracy and to improve the model to see if this will also increase the accuracy of the results.


INTRODUCTION
Machine Learning is highly regarded as a methodology to build business models. Finance and banking require a solid plan to run a business and keep their customers satisfied. A slight overlook could change the dynamic of the business.
The data that is generated on a regular basis from these businesses are in fact very significant for their future growth. In order to get a convincing result, a dataset of bank customers has been collected for model training and testing. Though the dataset collected is close to real world values, it consists of various features that might not actually contribute to the outcome of our study.
The outcome can determine the future relation of the bank customers with the banks; therefore, the methods should produce results with a great accuracy. For this to be possible, we are going to apply a methodology called feature selection that will help us determine the features in the dataset that hold the highest importance in giving the most accurate result. Feature selection is regarded as one of the most important steps in removing unnecessary data and reducing the complexity of the dataset.
This study aims to improve the accuracy of determining which customer will be staying with the bank and who will be leaving using the independent variables provided in the dataset and by removing the independent, non-influential variables that have no significance in affecting the accuracy. By removing these variables, the dataset can be optimised without any loss of accuracy.

RELATED WORK
S. Khalid et.al [1]: They explored different dimensionality reduction techniques to understand the strengths and weaknesses of various currently under use procedures. Chih-Ta Lin et.al [2]: They used feature selection to convert high-dimensional feature vectors into lowdimensional feature vectors using a real-world dataset. Lawrence B. Holder et.al [3]: They explored the improvements in feature extraction methods by working on the current and future trends in feature selection for classification problems. C. A. Murthy et.al [4]: He worked on developing a unified framework of feature selection and extraction for both supervised and unsupervised cases. Zena M. Hira et.al [5]: They have used feature selection an extraction method to analyse microarrays as they are very large in size. They have also provided a comparison of various selection methods. Kratarth Goel et.al [6]: They have used CDFs to improve the accuracy of classification problems. They have provided great results by using the methods on two totally different problem statements. Our study on feature selection is also related to various studies done by other researchers [7...13].
E3S Web of Conferences 184, 01011 (2020) https://doi.org/10.1051/e3sconf/202018401011 ICMED 2020 1. Data Collection: The dataset for determining customers' behaviour has been collected from Udemy. The dataset has 10,000 rows and 14 columns. The features of this dataset include RowNumber, CustomerId, Surname, CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary and Exited. The study will determine if these features influence the result or if they are just insignificant. For the testing dataset, the value of n was taken as 5 which was later used to calculate the value of k, i.e. k = sqrt(n)/2 which is the Euclidean distance for data plotting whereas the median is considered as the aggregation method. 3. Encoding: As the dataset also contained categorical and text values, encoding had to be done to convert them into numerical inputs. This was achieved using LabelEncoder and OneHotEncoder of the SciKit python library. 4. The methodology of our study next required us to select the best machine learning algorithms that would generate the highest accuracy in predicting which bank customer will be staying and who will be exiting. Model Selection: Model Selection is quite an important phase in Applying Machine Learning methods to any dataset as the outcome will depend on how effectively the algorithms that have been selected fit the data. Data scientists and analysts use different Machine Learning algorithms on their datasets. We can divide those algorithms into supervised and unsupervised algorithms. Depending on the output label supervised is again classified into classification and regression.
The output label has been identified as classification. Therefore, the following classification algorithms are applied to the dataset:  K Neighbors classifier: This method had been used as it considers all the data points and classifies them based on the Euclidean distance function.  Random Forest Classifier: This method had been considered as it uses decision trees for classifying the data. This, in fact, proved to be efficient in working with test errors.  Extra Trees Classifier: This method was considered as it avoids overfitting the data by randomizing a few decisions and the data in subsets.  Random forest regressor: A random forest can perform both regression and classification tasks on the data by building multiple decision trees. This has been considered as it offers efficient estimates of test error. These methods had been used to fit the data points by prediction. A confusion matrix was visualized to understand the errors generated in the process. The accuracy of the classifiers was tested to determine the best classifier. 5. Feature Selection: It was identified that the accuracy of the prediction can be improved by removing insignificant features as they increased the size of the dataset and increased its complexity. To help the dataset to train faster, feature selection was done. Extra Trees Classifier was used to visualize the importance of plotting each feature in the dataset. On visualizing this, the features that are significant were kept and the insignificant features were removed. Finally, Random Forest classifier was used to plot decision trees to see if there was any loss or improvement in the accuracy after feature selection.

RESULTS
The confusion matrix visualized to get the test errors is as follows.

Fig. 2. Confusion matrix
The accuracy score of predicting which customer will be staying and who will be leaving determined using the K Neighbors on the testing dataset was 82.85%.
On plotting a chart using ExtraTrees Classifier, features that were insignificant were removed.  The features that have significant importance determined using Chi-square are RowNumber, Geography, Gender, Age, Balance, NumOfProducts, IsActiveMember.
Other features have been removed before plotting and calculating the accuracy score using the training dataset.
On predicting the values using Random Forest Classifier, the accuracy score has improved as follows: Accuracy of the prediction before feature selection: Accuracy of the prediction after feature selection using features determined by ExtraTrees Classifier: Accuracy of the prediction after feature selection using features determined by Chi-square: To understand the false positive rate and the true positive rate an ROC curve has been visualised below. The ROC curve plots the distributions of the probabilities. The xaxis takes the false positive rate whereas the y-axis takes the true positive rate.

CONCLUSION
On comparing the accuracy before and after feature selection, it was identified that it increased by 2.3%. The increase in the accuracy of predicting the 'Exited' feature of the customers' behaviour dataset has helped in improving the business model of the bank.
Bankers can now develop more personalised strategies to get their customers to stay with them and also devise future plans to attract more customers to their banks. For business analysts or data scientists who will work on such business models, it can be concluded that by feature selection the accuracy cannot just be improved but also decrease the complexity of the dataset and train the dataset faster. As we have seen through the methods applied above, we can't figure out the best parameters related to the dataset therefore, the future work can involve selecting the number of best parameters to increase the accuracy of the problem statement.