Deep Dive into Supervised Learning

4 min readJan 27, 2024

In the previous discussion, we touched upon the concept that supervised learning involves working with labeled data, where the machine is directed to predict or classify specific outcomes based on the information provided. For instance, you might instruct the machine to distinguish between male and female based on the input data or predict a salary using the features present in the dataset. Now, let’s delve into a practical example by exploring a dataset and implementing the code to provide a more tangible understanding of how these concepts operate.

We will utilize a dataset sourced from Kaggle. Dataset represents data. Datasets can be sourced from various platforms such as Excel sheets, Google Sheets, databases, or even web scraping. In technical terms, we refer to these collections as datasets. For this demonstration, we’ll be working with several Python libraries, including scikit-learn, pandas, seaborn, matplotlib, and Jupyter Notebook.

Jupyter Notebook serves as an interactive editor allowing us to execute Python code, perform data analysis, and construct machine learning models. If you’re interested in obtaining this package, you can download it from the following link: https://www.anaconda.com/download.

Our initial step involves reading and displaying the dataset. The codes presented here are designed for execution within a Jupyter Notebook environment. As we progress through this course, the purpose and functionality of each library will become clearer.

As mentioned earlier, datasets can come in various formats, with CSV being a common choice. In this instance, the dataset is named “Salary Data,” and we use the pandas library (abbreviated as “pd”) to read the CSV and explore the dataset’s structure.

The dataset comprises of 6 columns and 375 rows. The Age, Gender, Education Level, Job Title, and Years of Experience are the features and Salary is the label.

Data Cleaning - It’s essential to clean the dataset, addressing issues such as missing values, as using unclean data can impact model performance. Assuming we have a clean dataset, we can proceed with building the model.

Label Encoding - Examining the dataset, we observe a combination of numerical and categorical features. Machine learning models, however, cannot effectively learn from categorical values. To address this, we employ the scikit-learn label encoder, a tool designed to convert categorical values into numerical representations. The process involves applying the label encoder to the categorical features, resulting in a transformed dataset with numerical representations for the previously categorical values. This transformation is a crucial preprocessing step, ensuring that the model can properly interpret and learn from all features in the dataset. From the table below we can see that the Education Level now has numerical values.

Feature selection - Here we designate two variables, X and Y, for the model. Here, X represents the features, and Y represents the label. To effectively train the model and enable it to generalize well to unseen data, we split the dataset into 70% training data and 30% testing data. Although there’s no strict rule for the splitting ratio, providing the model with sufficient training data is essential for effective learning.

Model Selection - In this demonstration, we opt for the Linear Regression model, a choice often made in supervised learning for regression problems, where the goal is to predict a continuous label, such as an integer. In supervised learning, the label can be either a regression (continuous) or classification (string — yes/no) outcome.

To implement the chosen Linear Regression model from scikit-learn, we need to import the model in our Jupyter Notebook and initialize it.

Following initialization, the next step involves prediction. This is accomplished by fitting the model with the training data and subsequently predicting with the testing data. After making predictions, we can assess the model’s accuracy, which, in this instance, stands at 92%. This is a commendable accuracy score.

In our forthcoming post, we will delve into Feature Selection, elucidating how it enhances our understanding and aids our supervised model in learning more effectively. In the meantime you can play around with this supervised learning model using the same dataset https://salaryprediction-chiedu.streamlit.app/.

Deep Dive into Supervised Learning

Written by Chiedu Mokwunye

No responses yet