FEATURE SELECTION IN SUPERVISED LEARNING- PART 3

Chiedu Mokwunye
4 min readMar 13, 2024

--

In our previous discussions, we covered the concept of feature selection, breaking it down into three main methods: Filter, Wrapper, and Embedded. We delved into the Filter method in “PART 1” and explored the Wrapper Method in detail in “PART 2.” Now, in “PART 3,” we turn our attention to the Embedded method. Unlike the previous methods, Embedded method involves machine learning models directly in the feature selection process. While we won’t delve into the intricacies of specific models, we’ll provide a brief overview of this concept and highlight some common models that fall under this category.

Embedded Method
The Embedded Method, also referred to as the “Intrinsic Method,” combines elements of both the Wrapper and Filter methods. Unlike the other methods where feature selection occurs before or after model training, the Embedded Method incorporates feature selection directly into the model’s training process. This means that during model training, the algorithm learns which features are most relevant for prediction, effectively selecting features as it builds the model.
Some algorithms that naturally perform feature selection during training include:

LASSO (Least Absolute Shrinkage and Selection Operator)/ Linear Regression: This LASSO method often used in Linear Regression to select the most important features by penalizing the coefficients of less important ones. In a linear regression context, these coefficients represent the weights assigned to each feature in the dataset. For instance, in a simple linear regression model represented by y=mx+b, where y is the predicted outcome,m is the coefficient (slope), x is the feature, and b is the intercept, the coefficient m tells us how much that feature affects the predicted outcome.

For instance, when predicting house prices, the coefficients would indicate the influence of each feature on the predicted price. The LASSO method works by shrinking some coefficients to zero, effectively removing those features from the model. This way, it selects the most important features while discarding the less important ones. The coefficients can be manually calculated using this formula. It’s essential to note that there are various formulas and techniques for calculating coefficients, and LASSO provides a systematic way to select the most influential features for a given prediction task.

In practice, the coefficients in the formula are typically not manually computed. Instead, when training models such as Linear Regression with LASSO regularization, the model itself provides methods to obtain these coefficients for each feature. These coefficients can be positive or negative.

  • Positive Coefficients: A positive coefficient signifies a direct relationship with the dependent variable. In simpler terms, if the coefficient increases, the predicted outcome also increases. This indicates that the feature has a positive impact on the predicted outcome.
  • Negative Coefficients: Conversely, a negative coefficient indicates an inverse relationship with the dependent variable. When the coefficient increases, the predicted outcome decreases. This shows that the feature has a negative impact on the predicted outcome.

Features with non-zero coefficients are considered important by the LASSO model. These features are retained in the model as they have a significant impact on the prediction. On the other hand, features with coefficients close to zero are shrunk by the LASSO model, effectively reducing their impact, and eventually setting them to zero. These features are then removed from the model as they are considered less important for making accurate predictions.

Random Forest : This is an ensemble learning method, which means it combines the results of multiple decision trees during the model training process. Feature selection in Random Forest is inherent to how these trees are constructed.

In this method, Random Forest randomly selects subsets of features and builds individual decision trees. Each combination of features becomes a tree in the model. For example, features like “bedrooms” and “income” might be combined to form a tree. The algorithm then selects the best feature within these subsets to split the data at each node of the tree.

After training, Random Forest assesses the importance of each feature. It does this by measuring how much the accuracy of the model decreases when a feature’s values are randomly shuffled. Features that cause a significant decrease in model accuracy when their values are shuffled are considered important by the algorithm.

Random Forest provides a feature importance method, which allows you to obtain the ranking of each feature along with their respective importance scores. This helps in understanding which features have the most influence on the model’s predictions.

Another popular ensemble method is Gradient Boosting, which works by building trees sequentially, with each subsequent tree correcting the errors of its predecessor.

Here’s how it typically works:

  • Initially, a tree is built and the residual error is calculated. This error represents the difference between the actual values and the predicted values from the initial model.
  • Next, a new tree is constructed to predict these residuals from the initial model.
  • The predictions from this new tree are then added to the predictions of the initial model, creating a combined prediction that is more accurate than the initial model alone.
  • This process continues iteratively, with each new tree correcting the errors of the combined model.
  • Eventually, the process is completed and the model’s feature importance can be determined. Each feature is assigned an importance score ranging from 0 to 1, indicating how much it contributes to the model’s predictive performance.

In the realm of embedded methods, there are various other models that focus on using a model itself for feature selection. It’s important to note that not all machine learning models fall under this embedded method, but those that do can offer powerful feature selection capabilities.

This comes to the end of the topic, feature selection.

--

--