FEATURE SELECTION IN SUPERVISED LEARNING- PART 2

Chiedu Mokwunye
8 min readMar 3, 2024

--

In “PART 1“ we delved into the concept of feature selection and categorized it into three main methods: Filter, Wrapper, and Embedded. Our focus was primarily on the Filter method. Now, in “PART 2,” let’s discuss the Wrapper Method and explore its intricacies.

Wrapper Method
Wrapper method is a technique of feature selection that selects a subset of features by wrapping features together and then selecting which features work together. Within this method, there are various techniques for feature selection. The techniques are as follows:

1. Forward Feature Selection (FFS): According to MLXTEND which is the library you probably would be using to implement this method defines forward feature selection as a family of greedy search algorithms that are used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The primary objective of this method is to select the best subset of features for a given model. Here’s how the algorithm works:

Take this code generated from chat GPT. We start by loading a dataset from the Boston dataset available in scikit-learn’s library. This dataset contains features (X) and a target variable (y). The features (X) represent the input data, while the target (y) is what we aim to predict.

Next, we split the dataset into training and testing sets. This step is crucial for evaluating the model’s performance. The training set is used to train the model, while the testing set is used to evaluate its performance.

We then choose a regression model, Linear Regression in this case, since the target variable (y) is continuous. Using a for loop, the algorithm iterates through each feature. For each iteration, it builds a model using only that feature and calculates the Mean Squared Error (MSE). The MSE is a metric that quantifies the difference between predicted values and actual values. A lower MSE indicates a better-performing model.

In simpler terms the FFS algorithm:
- Starts out with an empty set of features
- Iterates using a forloop
- For each feature (F1, F2, …, Fn), the algorithm starts by selecting the first feature (F1) and uses only that feature to build the model and calculates the MSE. It then repeats this process for the next feature (F2), building the model and calculating the MSE. This continues for each feature (F3, F4, …, Fn) until all features have been evaluated. After this initial pass, the algorithm selects the feature with the lowest MSE score and adds it to a list. Assuming F10 gave us the lowest score it would put that feature in a list.

Now, as it iterates through the features again, the algorithm no longer starts with an empty list. It incorporates the previously selected feature in each step. This means that each new feature is evaluated in combination with the features already selected. The algorithm builds a list of features based on their individual MSE scores, choosing the feature that contributes the most to improving the model at each step.

For instance, if in the process F2 combined with F10 gives the lowest MSE, it would add both F2 and F10 to the list of selected features. This iterative process continues until it finds the combination of features that collectively result in the lowest MSE. So if, after evaluating all 12 features, a combination of 6 features yields the lowest MSE, those 6 features would be chosen to train the model.

The code snippet below illustrates this process. In the left image, F12 has the lowest MSE score initially, so it is added to the feature list. Then, in the right image, the algorithm combines the remaining features, finding that the combination of F12 and F10 gives an even lower MSE. This process continues until the MSE of the previous combination is lower than the newly combined features or a condition is met.

2. Backward Feature Elimination (BFE): In this method, all features in the dataset are initially used to build the model, and the MSE score is calculated. Then, the algorithm goes through the dataset iteratively, removing the feature that contributes the least to the model’s performance. Unlike Forward Feature Selection (FFS), where the algorithm starts with no features and adds them one by one, here it begins with all features and gradually removes those that have the least impact on the target variable.

Let’s use this code snippet generated with chat gpt. In this process, the algorithm initially builds the model with all the features and calculates the MSE score, which is 24.2911 in this case. Then, it begins a looping process where each feature is removed one at a time, and the MSE score is recalculated. For example, first, F1 is removed and the MSE score is computed. Then F1 is put back, and F2 is removed, and so on until all features have been removed and their respective MSE scores calculated.

Once this iteration is complete, the algorithm identifies the feature that, when removed at the first stage, resulted in the lowest MSE score. For instance, if removing F6 yielded a MSE score of 21.56, the algorithm determines that F6 had little impact on the target. It then proceeds to remove one feature at a time again, comparing the lowest MSE score at each step to the lowest score from the previous iterations. If the new lowest score is lower, the algorithm removes the corresponding feature. However, if the lowest score from the previous iterations remains lower, the algorithm concludes the process and presents the final selected features. See the image below for a visual representation of this process.

The code snippet above shows that in the initial iteration, the algorithm found that removing the feature “Age” resulted in the lowest MSE of 21.651 compared to removing any other feature. This was the best improvement in model performance observed in the first iteration.

Afterward, the process began again, removing one feature at a time. However, during subsequent iterations, no removal of any single feature yielded an MSE lower than 21.651. This indicates that removing any further features did not significantly improve the model’s performance beyond the point achieved by removing “Age”. Therefore, the process concluded, and the selected features shown below are the list of features without “Age,” as it was determined to have the least impact on the model’s predictive power.

3. Recursive Feature Elimination (RFE): This method is akin to Backward Feature Elimination (BFE), where the goal is to identify the most relevant features from a dataset. The process begins with all features included, then iteratively removes the least important features. This iterative elimination is based on specific criteria, such as the coefficients in linear models or the feature importances in tree-based models.

Here’s how it works:

  • Initially, all features are considered.
  • The algorithm builds a model using these features and evaluates their importance.
  • The least important feature is then removed from the set.
  • The process repeats with the remaining features until the desired number of features is reached or a specific criterion is met.

This method essentially “prunes” the feature set by gradually eliminating the least valuable features, resulting in a subset of features that are deemed most important for the model’s predictive power.

4. Bidirectional Elimination (Sequential Forward Floating Selection: This method combines elements of both Forward Feature Selection (FFS) and Backward Feature Elimination (BFE) methods. Here’s how it works:

  • Initially, the algorithm starts with an empty set of selected features.
  • It then performs Forward Feature Selection (FFS) by iteratively adding features one by one, evaluating their impact on the model’s performance.
  • After this forward selection process, it switches to Backward Feature Elimination (BFE), where it removes features one by one to check for any improvement in model performance.
  • This process of alternating between adding and removing features continues until the algorithm reaches the desired number of features or a certain performance criterion.

The idea behind Bidirectional Elimination is to combine the strengths of both forward and backward methods.

From the code snippet above the algorithm starts out with an empty set of features builds the model and calculates the MSE score and then picks the feature with the lowest MSE which in the code below is F6, then we move on and loop through the FFS algorithm again and choose the next lowest MSE which is F4. After starting with the Forward Feature Selection (FFS) method and selecting features like F6 and F4, the algorithm then applies Backward Feature Elimination (BFE). In BFE, it removes each feature and calculates the resulting MSE score. For example, when F6 is removed, the resulting MSE is 24362. However, when F6 is added back during FFS, the MSE is 21263, indicating it’s better to keep F6. This process is repeated for each feature, such as F4, where removing it results in an MSE of 21263, but adding it back gives an MSE of 13993. The algorithm continues this cycle, eliminating features only if removing them results in a lower MSE than keeping them.

Throughout this discussion of the methods, the focus has been on using Mean Squared Error (MSE) as the evaluation metric. However, when working with libraries like scikit-learn or MLXTEND for feature selection, there are often more options available for evaluation. These libraries allow you to specify different scoring parameters depending on your task. For example, with classifiers, you can use {accuracy, f1, precision, recall, roc_auc}, and for regressors, {‘mean_absolute_error’, ‘mean_squared_error’/’neg_mean_squared_error’, ‘median_absolute_error’, ‘r2’} are common options.

Additionally, you can also try different models beyond linear regression. It’s about experimenting with various scoring metrics and models to find the best result for your specific problem. For further exploration, you can delve into the documentation of MLXTEND and Scikit Learn. to discover more options and techniques.

Here are the pros and cons of using this feature selection method:
Pros:
- Accuracy
- Feature interaction
- Flexible and Customizable

Cons:
- Computationally Expensive
- Overfiiting
- Longer Training Time

In conclusion, no method is set in stone, so it’s essential to try out different approaches to find what works best for your dataset and model.

--

--