FEATURE SELECTION IN SUPERVISED LEARNING - PART 1

6 min readFeb 11, 2024

Feature selection is the process of choosing the most relevant features from a dataset to be used in training a model. When collecting data, there may be numerous features, some of which may not contribute significantly to the model’s predictive power. For instance, in a dataset related to a mechanic workshop’s inventory, details like the previous and current owner may be irrelevant to predicting the resale value of car parts.

The primary purpose of feature selection is to enhance the speed and computational efficiency of the model. By excluding unnecessary features, the model becomes more focused, leading to increased accuracy. Additionally, a well-selected set of features is less prone to overfitting, where the model becomes too tailored to the training data, and it minimizes the impact of noise, irrelevant information that may exist in the dataset.

In summary, feature selection streamlines the input variables to those that truly matter, resulting in a more efficient and accurate machine learning model.

If all that sounds too technical, consider feature selection as akin to playing a strategic game like “charades.” In this game, your goal is to communicate a word or phrase to your team using gestures and clues. Now, imagine you’ve drawn a card with the word “car” on it. To convey this word effectively, you wouldn’t start by describing the type of road it’s driving on; instead, you would focus on the most crucial details, such as mimicking engine sounds, steering motions, honking, and braking actions which is exactly what we are doing with feature selection.

How do we apply feature selection? Feature selection is categorized into 3:

Filter method
Wrapper method
Embedded method

Filter Method — The filter method involves selecting the most important features using statistical calculations. In this method, we can apply various filter techniques, not limited to this:

Pearson Correlation coefficient — This method measures the correlation between each feature and the target (label). The correlation coefficient varies between -1 and 1. A value closer to 1 indicates a strong positive correlation and is observed when the line from x to y goes upward,

Photo credit https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/images/R_value.png

while a value closer to -1 indicates a strong negative correlation and is observed when it is a downward slope.

A heatmap is often used to visualize the correlation matrix.

If we review the “salary” line vertically and look at the x axis we can see the correlation of each feature to the salary label. From the heatmap we can see that “gender” has the weakest correlation value to salary. The way the correlation matrix works is based of this mathematical formular.

Photo credit “https://study.com/academy/lesson/the-correlation-coefficient-definition-formula-example.html

Σ represents summation, N represents the total number of rows in the dataset. X represents the current feature, Y represents the next feature chosen. For example if we choose “Age” as the x, we would then calculate the correlation of “Age” to the other features. so in the first instance y would be “Category” then y would move on to be “Salary” and the rest until it goes through every feature. This process is repeated for each pair of features, resulting in correlation values displayed in the heatmap.

Σx = Sum total of the x axis column
Σy = Sum total of the y axis column
Σx² = Sum total of the x squared axis column
Σy² = Sum total of the y squared axis column
Σxy = Sum total of x*y axis column

2. Variance threshold — This method aims to eliminate features with low variance, operating under the assumption that features displaying minimal variation offer less informative content. The goal is to simplify the model, enhancing its interpretability. The variance of a feature is determined by the formula:

Where:

n is the number of data points in the feature.
xi represents each individual data point in the feature.
ˉx is the mean (average) of the feature values.

In practical terms, consider a dataset, and let’s take the feature “Age” as an example. The variance score for “Age” is computed by applying the variance formula. This calculation provides a measure of how much the values of “Age” deviate from its mean. This example showcases both our manual calculation and the equivalent result obtained using scikit-learn’s library.

3. SelectKBest — This method is used to select the top k features based on univariate statistical tests. Here, “k” represents the number of features to be selected. For instance, if k=5, the method will choose the top 5 features. The formula for SelectKBest varies depending on the scoring function, which is a parameter in this method. The scoring function is chosen based on whether the problem is a classification or regression one. Some available scoring functions include ANOVA, Mutual Information, Chi-squared, and F-regression. For more check out the sklearn’s website “https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html”.

For example, in a regression problem like the one we’ve been discussing with the “Salary” dataset, we might choose the F-regression scoring function. However, let’s focus on the Chi-squared scoring function for now. The Chi-squared calculation is based on the observed and expected outcomes.

The Chi-squared is calculated with this formula:

O is the Observed outcome and E is the expected outcome.

To illustrate, consider a dataset categorizing individuals by their Education Level and Gender. We first calculate the observed frequency of each combination, such as the number of Bachelor degree holders who are male or female. Then, we compute the expected frequency for each combination using a formula based on row and column totals. If we look at the table above we can see that under the “Education Level” column Bachelor appears 5 and out of the 5 occurences 3 are males and 2 are females. This is what we observe.

The expected frequency calculation is different from the observed, and it has a formula which is (Ri * Cj) / N

Where:

Ri is the total count in row
Cj is the total count in column
N is the total count of observations in the entire table

For example Bachelors Female = ((2+3) * (2+2+0)) / 10 and the result “2” would be the expected frequency. Same way we would calculate the Master’s and the PhD. The result should be this

After obtaining both observed and expected frequencies, we compute the Chi-squared value by comparing observed and expected frequencies for each combination. This involves subtracting the observed frequency from the expected frequency, squaring the result, dividing by the expected frequency, and summing these values across all combinations.

While understanding these formulas can be insightful, in practice, we often rely on libraries like scikit-learn to handle feature selection, abstracting away the intricate details of the underlying statistical calculations.

To be continued….

FEATURE SELECTION IN SUPERVISED LEARNING - PART 1

Written by Chiedu Mokwunye

No responses yet