Getting started with machine learning is no easy task—especially if you're unfamiliar with the mathematical foundations behind the algorithms. But what if I told you that you don't need to be a math expert to create your very own classification model? With scikit-learn, a powerful and beginner-friendly Python library, you can dive into machine learning and build practical models in no time.
In this blog, I’ll walk you through the process of creating a classification model using scikit-learn. For this demonstration, we’ll work with a diabetes prediction dataset. While this experiment won’t produce a clinically reliable model (for that, you’d need more advanced research), it will help you grasp the fundamentals of supervised learning and how to apply them using Python.
This tutorial is designed to introduce you to the basics of supervised learning and show you how to implement it step-by-step using Python and scikit-learn. In this demonstration, we’ll focus on classification . Specifically, we’ll build a binary classification model to predict whether a person has diabetes.
About the Dataset
The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The dataset includes the following features:
- Age : The patient’s age.
- Gender : The patient’s gender.
- BMI : Body Mass Index, a measure of body fat based on height and weight.
- Hypertension : Whether the patient has hypertension (yes/no).
- Heart Disease : Whether the patient has heart disease (yes/no).
- Smoking History : The patient’s smoking habits.
- HbA1c Level : A measure of average blood sugar levels over the past 2-3 months.
- Blood Glucose Level : The patient’s current blood glucose level.
As there are two outcomes here - Diagnosis – 0: no presence of Diabetes, 1: indicating presence of Diabetes, this is known as binary classification.
Preparing the Dataset
Before building the model, we need to prepare the dataset. This involves:
1. Loading the Data: Importing the dataset into our Python environment. Click the link here and download the dataset. You can also download via kagglehub or Kaggle CLI. Make sure that you can access your dataset on the same directory as where your jupyter notebook is.
2. Exploring the Data: Understanding the structure and distribution of the data.
Before using supervised learning, make sure that:
- No missing values
- Data in numeric format
- Data stored in pandas DataFrame or NumPy array
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
df=pd.read_csv("./diabetes_prediction_dataset.csv")
df.head()
3. Preprocessing the Data: Handling missing values, encoding categorical variables, and scaling numerical features. In this dataset, we have two features having the object data type.
df.info()
To fix this, we will utilize a label encoder to transform our data.
le=LabelEncoder()
gender_le = le.fit_transform(df['gender'])
smoking_history_le =le.fit_transform(df['smoking_history'])
df.drop(columns=["gender","smoking_history"], axis=1, inplace=True)
df["gender"] = gender_le
df["smoking_history"] = smoking_history_le
df.info()
Building the Model
Now that the data is ready, we’ll train a K-Nearest Neighbors (KNN) classifier. KNN is a simple yet effective algorithm for classification tasks. It works by finding the K closest data points (neighbors) in the training set and making predictions based on their labels.
Here’s what we’ll do:
- Split the data into training and testing sets.
- Train the KNN model on the training data.
- Tune the hyperparameters (e.g., choosing the right value of K).
We’ll implement this step-by-step using Python and scikit-learn.
X=df.drop(columns="diabetes").values
y=df["diabetes"].values
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
train_accuracies = {}
test_accuracies = {}
neighbors=np.arange(1, 26)
for neighbor in neighbors:
knn = KNeighborsClassifier(n_neighbors=neighbor)
knn.fit(X_train, y_train)
train_accuracies[neighbor] = knn.score(X_train, y_train)
test_accuracies[neighbor] = knn.score(X_test, y_test)
print(f"k = {neighbor}: Train Accuracy = {train_accuracies[neighbor]:.4f}, Test Accuracy = {test_accuracies[neighbor]:.4f}")
Once you've run it all, you should be able to see something like this:
If plotted, it would be like this:
Seems like peak accuracy actually occurs around 13 neighbors. Let's try that value out and evaluate the model.
Evaluating the Model
To assess the model’s performance, we’ll use metrics such as:
- Accuracy : The percentage of correct predictions.
- Precision, Recall, and F1-Score : Measures to evaluate the model’s ability to correctly classify positive cases (diabetic patients).
- Confusion Matrix : A breakdown of true positives, true negatives, false positives, and false negatives.
These metrics will give us a comprehensive understanding of how well our model performs.
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.4f}")
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
As you can see, there is high accuracy meaning most predictions are correct. Precision is good, the model predicts positive correct most of the time. However, there is low recall, meaning many actual positives are misclassified as negative. Recall needs improvement.
We can lower the number of our neighbor to 11 and then compare the results.
- Accuracy dropped (98.39% → 95.18%): The model made slightly more mistakes overall.
- Precision slightly decreased (95.4% → 92.86%): More false positives (FP increased from 55 → 92).
- Recall improved (44.6% → 46.94%): Fewer false negatives (FN decreased from 1,412 → 1,353).
- F1-Score increased (60.7% → 62.36%): A better balance between precision and recall.
Conclusion
In this tutorial, we explored the basics of supervised learning and built a binary classification model to predict diabetes using the K-Nearest Neighbors algorithm. While this model is far from perfect, it serves as an excellent starting point for anyone new to machine learning.
By following this guide, you’ve taken your first step into the world of machine learning. From here, you can experiment with other algorithms, optimize hyperparameters, and explore more complex datasets.