Introduction
Feature selection is a critical and widely used technique in data
processing, aimed at selecting the most relevant features from noisy
data. This approach not only enhances the execution speed of data
processing algorithms but also improves prediction accuracy and reduces
variability in results.
Feature Selection
Methods
Feature selection, a dimensionality reduction technique, focuses on
identifying a small subset of relevant features from the original
dataset by eliminating irrelevant, redundant, or noisy features. This
process typically enhances learning performance, improves model
accuracy, reduces computational costs, and increases model
interpretability.
Several statistical methods relevant to feature selection have been
discussed in various statistics courses. This section provides an
overview of feature selection types, methodologies, and techniques
commonly employed in both statistics and machine learning. Based on
their nature, these methods are categorized into four distinct types,
which will be outlined in the following subsections.
Filter Methods
Filter methods are statistical-based feature selection methods that
involve evaluating the relationship between each input variable and the
target (response) variable using statistics and selecting those input
variables that have the strongest relationship with the target variable.
These methods can be fast and effective, although the choice of
statistical measures depends on the data type of both the input and
output variables. Here are some of these methods with brief
descriptions.
Chi-square
Test
Let’s consider a scenario where we need to determine the relationship
between the independent category feature (predictor) and dependent
category feature(response). In feature selection, we aim to select the
features which are highly dependent on the response. We calculate
Chi-square between each feature and the response variable. and select
the desired number of features with the best Chi-square scores.
In order to correctly apply the chi-squared test for the relationship
between various features in the data set and the target variable, the
following conditions have to be met: the variables have to be
categorical, sampled independently and values should have an expected
frequency greater than 5.
Fisher’s Score
Fisher score is one of the most widely used supervised feature
selection methods. It seeks features with the best discriminant ability.
It is based on maximizing the distances between data points of different
classes and minimizing the distances among points of the same class. To
rank the features in the order of their relevancy, they are sorted in
the decreasing order of their obtained fisher score. Thus, as the value
of an assigned score to a feature increases, its importance also
increases.
Let \(Y\) be the categorical
variable with \(C\) categories and
\(X\) be a numerical variable. The
Fisher’s score of \(X\) is defined
by
\[
F_X = \frac{\sum_{i=1}^C N_i(\mu_{X}^i-\mu_{X})^2}{\sum_{i=1}^CN_i\times
(\sigma_X^i)^2}
\] where \(N_i\) is the number
of data points in class \(i\), \(\mu_X\) is the mean of feature variable
\(X\), and \(\mu_X^i\) and \((\sigma_X^i)^2\) are the mean and the
variance of class \(i\) upon the
feature \(X\) respectively.
The algorithm returns the ranks of the variables based on the
fisher’s score in descending order. We can then select the variables
based on the scores.
Correlation
Coefficient
Correlation is a measure of the linear relationship of 2 or more
variables. Through correlation, we can predict one variable from the
other. The logic behind using correlation for feature selection is that
the good variables are highly correlated with the response. Furthermore,
variables should be correlated with the response but should be
uncorrelated among themselves. This method is valid when both response
and feature variables are numeric.
Variance
Threshold
The variance threshold is a simple baseline approach to feature
selection. It removes all features which variance does not meet some
threshold. The logic for this method is that features with a higher
variance may contain more useful information.
Wrapper Methods
Wrappers require some method to search the space of all possible
subsets of features, assessing their quality by learning and evaluating
a classifier with that feature subset. The feature selection process is
based on a specific machine learning algorithm that we are trying to fit
on a given data set. It follows a greedy search approach by evaluating
all the possible combinations of features against the evaluation
criterion. The wrapper methods usually result in better predictive
accuracy than filter methods.
The following are a few such methods.
Forward Feature
Selection
This is an iterative method we start with the best performing
variable against the target. Next, we select another variable that gives
the best performance in combination with the first selected variable.
This process continues until the preset criterion is achieved.
Backward Feature
Elimination
This method works exactly opposite to the Forward Feature Selection
method. Here, we start with all the features available and build a
model. Next, we the variable from the model which gives the best
evaluation measure value. This process is continued until the preset
criterion is achieved.
Subset Feature
Selection
This is the most robust feature selection method covered so far. This
is a brute-force evaluation of each feature subset. This means that it
tries every possible combination of the variables and returns the
best-performing subset.
Embedded Methods
These methods encompass the benefits of both the wrapper and filter
methods, by including interactions of features but also maintaining
reasonable computational cost. Embedded methods are iterative in the
sense that takes care of each iteration of the model training process
and carefully extract those features which contribute the most to the
training for a particular iteration.
LASSO
Regularization
Regularization consists of adding a penalty to the different
parameters of the machine learning model to reduce the freedom of the
model, i.e. to avoid over-fitting. In linear model regularization, the
penalty is applied over the coefficients that multiply each of the
predictors. From the different types of regularization, Lasso or L1 has
the property that can shrink some of the coefficients to zero.
Therefore, that feature can be removed from the model.
Random Forest
Importance
Random Forests is a kind of a Bagging Algorithm that aggregates a
specified number of decision trees. The tree-based strategies used by
random forests naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini impurity) over
all trees. Nodes with the greatest decrease in impurity happen at the
start of the trees, while notes with the least decrease in impurity
occur at the end of trees. Thus, by pruning trees below a particular
node, we can create a subset of the most important features.
Hybrid Methods
Hybrid methods try to exploit the qualities of both approaches,
filter, and wrapper, trying to have a good compromise between efficiency
(computational effort) and effectiveness (quality in the associated
objective task when using the selected features).
To take advantage of the filter and wrapper approaches, hybrid
methods, in a filter stage, the features are ranked or selected applying
a measure based on intrinsic properties of the data. While, in a wrapper
stage, certain feature subsets are evaluated for finding the best one,
through a specific clustering algorithm. We can distinguish two types of
hybrid methods: methods based on ranking and methods non-based on the
ranking of features.
