Feature Selection: A Review and Comparative Study

. Feature selection (FS) is an important research topic in the area of data mining and machine learning. FS aims at dealing with the high dimensionality problem. It is the process of selecting the relevant features and removing the irrelevant, redundant and noisy ones, intending to obtain the best performing subset of original features without any transformation. This paper provides a comprehensive review of FS literature intending to supplement insights and recommendations to help readers. Moreover, an empirical study of six well-known feature selection methods is presented so as to critically analyzing their applicability.


Introduction
The vast amount of data has bombarded machine learning researchers with a multitude of unparalleled problems, making the learning task more challenging and computationally difficult. When using data mining and machine learning algorithms to analyze highdimensional data, the performance of the learning algorithms can deteriorate due to over-fitting problem [1]. Machine learning models become more difficult to understand as the number of features increases, resulting in a reduction in generalizability [2]. To reduce the high dimensionality of data, data mining may use dimensionality reduction techniques. Feature extraction and selection are two types of dimensionality reduction [3]. The main objective of feature extraction (FE) is to reduce the initial feature space to a smaller one, where features lose their meaning due to the transformation (see Fig.1). The commonly used feature extraction methods include multi-dimensionality scaling [7], Isomap [8], Local linear Embedding [9], and Principal Component Analysis [10]. Feature selection (FS), as opposed to feature extraction, is the task of selecting relevant attributes and deleting irrelevant and redundant ones in order to obtaining the highest best feature subset without transformation (See Fig.1). As a result, learning models built with the chosen subset of features are more readable and interpretable [3,4,5,25].The main reasons behind using feature selection are: Reducing the storage capacity and execution time, preventing the curse of dimensionality problem, minimizing the over-fitting issue, resulting in improved model generalization and increasing the performance attainability [6].
This paper review will provide a critical analysis of six feature selection methods from different categories to expose their advantages and disadvantages. Moreover, we will shed light on the recent advance in feature selection in order to provide some guidelines and recommendations to practitioners and researchers. Along with reviewing the existing feature selection methods, we will empirically evaluate the effectiveness, shortcomings and applicability of six well-known FS methods including ReliefF, MI, SVM-RFE, SFS, Lasso and Ridge. In section 2, six FS methods are well discussed as well as their strength and limitations. Section 3 is devoted the obtained results applying the previously discussed FS methods. As to the end of this paper, we present the conclusion. Feature selection is categorized into three major approaches: Filter, Wrapper and Embedded approach.
 Filters: which perform feature selection without involving any modeling algorithm by relying on certain feature characteristics of training data, primarily its relationship with the class label. Filter methods are faster and more generalizable compared to Wrappers.  Wrappers: instead of evaluating features individually as in filters, wrappers evaluate features taking into account the interaction between them.
To determine and compare the relative output of each generated subset, they use learning algorithms which leads generally to make the learning process complex and expensive in terms of execution time.  Embedded: Since feature selection is conducted and optimized during the classification process, this

Feature selection methods classification
Feature selection is an active research filed in machine learning, as it is an important pre-processing, finding success in different real problem applications. In general, feature selection algorithms are categorized into supervised, Semi-supervised and Unsupervised feature selection [2,3,4,5,6]. Supervised feature selection methods usually come in three flavors: Filter, Wrapper and Embedded approach. In this review, we will focus mainly on feature selection in high dimensional classification tasks.

Filter Methods
Filters evaluate feature relevance based on data intrinsic characteristics. As it can be seen in Figure 2, this category is a pre-processing stage that is independent of the induction algorithm. There are two key stages in the filtering process. To cause the classification model, it first ranks features individually based on a particular criterion measure such as distance, Pearson correlation, and entropy [4]. Second, it selects the best-ranked features using a threshold value. The remaining features are deemed to be unnecessary and uninformative. After that, the induction classifier receives the selected subset of features as input. Filters are fast which makes them more suitable for high dimensional datasets [3,4,5].
Since the relations between the independent variables aren't considered, redundant features may be allowed to be chosen. The following sub-sections provide an indepth analysis of two well-known filter methods.

Mutual Information
This technique shows how much information is exchanged between a feature and the class label (between two features). MI is a measure of the impact of one random variable on another [22,23,26]. The characteristic with the most shared information with the class target is the best, since it may effectively identify members of one class from those of another. MI is mathematically defined as follows: MI is zero when X and Y are statistically independent ( ( ( ), ( )) = ( ( ). ( ( )).

Relief
This method is an instance-based method. It tries to find the nearest neighbor from the same class (nearest hit) and from the opposite class for each training example (nearest miss) [12]. The difference in each feature's score is the difference between them. Relief quantifies each feature with a score Wi, which can be used to rank features. Relief approach can be highly effective for high-dimensional datasets since the weighting Wi of each feature is iteratively modified for each selected case. If a feature weight from the feature in nearby instances of the same class rather than nearby instances of the other class, the feature weight decreases; otherwise, it increases. Each feature's weight is calculated as follows: = − ( − ) 2 + ( − ) 2 (2) This algorithm's limitations include the fact that it is only applicable to two-class classification problems and does not tackle redundancy. An extended version of Relief called Relief-A is proposed in [13] for solving the missing data issue in order to solve the previously described problems with Relief. Relief-F is implemented to solve multi-class issues [24].

Wrapper Methods
On the contrary, wrapper approach construct models from scratch for each generated subset [14]. Then, it uses prediction performance as a criterion function verify its efficiency. This category takes into account the interactions between features. Generally, Wrappers achieve a better performance than Filters. However, they are very expensive in terms of models complexity and resources requirements. Many wrapper methods have been proposed in the literature [1,2,3,4,5]. In following subsections, we will concentrate on the most widely used wrapper methods.

Support Vector Machine Recursive Feature Elimination (SVM-RFE)
In [15], Guyon introduced a hybrid feature selection method where a support vector machine is based on recursive feature elimination (SVM-RFE). It begins with the all features then iteratively it eliminates features without backtracking. This greedy top-down strategy computes the normal vector of a decision hyper-plane created by a linear support vector machine (SVM) classifier. Projections of the normal vector to the coordinate system are subsequently used to assess the ranking of individual features. The features with the highest ranking are retained, and those with the lowest ranking are removed. This procedure is repeated until the desired number of features is attained, or the performance cannot be improved any further.

Sequential Forward Selection (SFS)
SFS is greedy search strategy [16]. It begins with an empty set and adds features in a greedy manner until the performance can no longer be improved. When the number of representative features (optimal subset) is minimal, SFS works well [17,18]. The disadvantage of this technique is that once a characteristic is included to the subset, it cannot be withdrawn, resulting in suboptimal outcomes.

Embedded Methods
Unlike the wrapper strategy, which uses a heuristic search driven by the classifier's results, the embedded strategy is a compromise between the filter and wrapper approaches. It performs feature selection and creates an optimized classifier using the learning process itself. The embedded method, which is a middle ground between filters and wrappers, chooses features that emerge during the learning process based on the classifier's evaluation criteria, resulting in lower computational costs than wrappers. Regularization approaches are the most commonly used embedded methods.

LASSO (L1-Regulariezation)
The Least Absolte Shrinkage and Selection Operator, introduced by Robert Tibshirani in 1996 [19], is an efficient regularization function selection tool (LASSO). The LASSO approach penalizes the number of the machine learning algorithm parameters' absolute values. Any coefficients are reduced to zero by the additional limit (regularization). During the feature selection process, features with a non-zero coefficient after the shrinkage step are selected to be used in the model, whereas those with a coefficient of exactly zero are omitted. A tuning parameter (regularization parameter) is used to adjust the frequency of the regularization [20]. When λ is sufficiently enough, coefficients are forced to be precisely zero which results in reducing dimensionality. Indeed, the larger λ is, the more coefficients are shrinked to zero. On the other hand, if λ=0 this means that no regularization is applied (Ordinary Least Square). There are some benefits of using LASSO. Since the coefficients of the deceptive features are penalized and eliminated, it is effective at minimizing variance. As a result, it aids in preventing the over-fitting issue, resulting in better generalization ability. Furthermore, LASSO is very useful for improving interpretability by canceling irrelevant features. The LASSO approach is introduced in various mathematical forms; according to Robert Tibshirani, the lasso approximation is determined by the solution to the L1 regularization.
Where t is the upper bound for the sum of the coefficients. This optimization problem is equivalent to the parameter estimation that follows: Where : and λ ≥0 is the regularization parameter that control the amount of weight shrinkage. The lasso method has some limitations:  LASSO prefers to select one feature from each set of associated features while ignoring the others.  In datasets of small n and large p (p is the number of features), LASSO selects no more than n attributes before saturating.

RIDGE (L2-Regularization)
RIGE regularization, also known as L2-Regularization, is a regularization technique. The objective function is given a squired magnitude of the model parameters as a penalty expression. Ridge shrinks model coefficients close to zero but not precisely zero, unlike LASSO, which offers a sparse collection of functions. The biggest difference between LASSO and Ridge strategy is that LASSO reduces the coefficients of unimportant to zero, effectively canceling them out. Ridge, on the other hand, produces non-zero coefficients, which is more useful for interpreting features. The following Table (Table 1) provides a more detailed summary of the properties of the three feature selection categories, as well as the most prominent advantages and disadvantages.

Experimental results and discussion
In this section, an extensive number of experiments have been conducted to evaluate the previously discussed feature selection methods from different categories IG and Relief for filters, RFE-SVM and RFE-RF for wrappers, Ridge and LASOO for embedded. The conducted experiments have been carried out using seven benchmarking datasets.

Datasets
To assess our approach, we used seven binary classification datasets, which are available on UCI machine learning repository [21], including Sonar, Spambase, Eighthr, Clean, Chess, Madelon, and Ds1.100. More details about the characteristics of each dataset are presented in Table 2. To assess the strengths and the weaknesses feature selection methods applicability, we carefully selected datasets to be different in terms of the number of data points, attributes, linearity and the distribution of class label.

Evaluation
In this experimental section, we tested the performance of six feature selection methods from different FS categories when used in conjunction with two classification algorithms (Random Forest, Supportvector machines) to obtained more reliable results. We use classification accuracy to evaluate the performance of each FS methods. Generally, the higher the classification accuracy, the better the selected feature subset is.

Discussion
By carefully analyzing the obtained accuracies of filter methods (see Table 4), it is clear that wrapper methods are able to increase the classification accuracy in the majority of datasets (5 out of 7) followed by embedded methods (4 out of 7) and filters have increased the classification accuracy of 3 datasets out of 7. Besides classification accuracy, time execution is another critical aspect in FS especially for high dimensional datasets. Table 3 presents the running time (in seconds) of the implemented FS methods. It is obvious that Filter methods are faster than other approaches (Wrapper and embedded). As wrappers make use of learning algorithms to assess each feature subset, there are very expensive compared to filter category. This is even clearly demonstrated when the size of the dataset becomes larger (see datasets ds1.100 and madelon). Embedded methods are good alternatives especially when the feature space is of high dimension.

Conclusion
In this review paper, we have provided a comprehensive survey of feature selection state-of-the-art methods and their categorization as well. Then, we have cited and discussed several fundamental algorithms of each category. Moreover, critical analysis and comparison have been conducted to expose the merits and demerits of each FS method. In addition, we have provided an empirical study to evaluated six feature selection methods including: RatlifF and IG as filters, SFS, RFE-SVM as wrappers, LASSO, RIDGE as embedded. Each method is evaluated in combination with two wellknown classifiers (SVM and RF) to reliably test their ability to select the best subset. Seven benchmarking datasets have been employed to demonstrate the applicability of feature selection methods. The results clearly confirm that too much information does not always help machine learning algorithms as it implies that feature subset may include redundant or irrelevant features, and their presence can deteriorate the performance of the classifier. As a result, feature selection techniques are infeasible to reduce the data dimensionality.
At the end of this paper, we present some overlooked and open issues that must be discussed and analyzed further:  Ensemble techniques should be exploited in feature selection field to enhance the stability selection.  Since Wrapper approach is mostly avoided in the literature, the focus should be spotlighted on hybrid and ensemble approaches.  Due to the immense volume of generated datasets in many domains including image recognitions, text classification and biomedical, we intend to apply deep learning models to estimates the goodness of features relying on the existed hidden feature interaction that may be well-detected using deep learning instead of machine learning models.  Using feature selection methods in environmental sound recognition could be helpful for extracting a large amount of information to understand our environment.