Project Overview

This project aims to analyze student performance based on various factors such as gender, ethnicity, parental education, lunch type, and test preparation course. The model predicts student scores in reading, writing, and math using machine learning techniques.

Problem Statement

This project understand how the student's performance in math, reading, and writing based on demographic and contextual factors such as gender, ethnicity, parental education, lunch type, and test preparation. The goal is to identify patterns and relationships influencing academic achievement.

Technologies Used

Programming Language: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
Framework: Flask (for deployment)

Data Collection

Dataset source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/code
The dataset consists of 8 column and 1000 rows

Data Description

gender: The gender of the student (e.g., male, female).
race_ethnicity: The group classification of the student (e.g., group A, group B, etc.).
parental_level_of_education: The highest level of education attained by the student's parent(s) (e.g., bachelor's degree, some college).
lunch: The type of lunch the student receives (e.g., standard, free/reduced).
test_preparation_course: Whether the student completed a test preparation course (e.g., none, completed).
math_score: The student's score in mathematics.
reading_score: The student's score in reading.
writing_score: The student's score in writing.

Insight

Consistent Performance: The average scores for math, reading, and writing are relatively close, suggesting a balanced curriculum or consistent evaluation criteria.
High Standard Deviation: The standard deviations for all three subjects indicate some variability in performance.
Outliers: Scores like 0 in math and 10 in writing might require investigation to check if they are outliers or represent missing/incomplete data

Exploring Data

Analysing how many number of students got full mark in respective field

math_full = df[df['math_score'] == 100]['average_score'].count()
writing_full = df[df['writing_score'] == 100]['average_score'].count()
reading_full = df[df['reading_score'] == 100]['average_score'].count()


print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')

math_less_20 = df[df['math_score'] <= 20]['average_score'].count()
writing_less_20 = df[df['writing_score'] <= 20]['average_score'].count()
reading_less_20 = df[df['reading_score'] <= 20]['average_score'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')

Insights

From above values we get students have performed the worst in Maths
Best performance is in reading section

Visualize average score distribution to make some conclusion

Insights

Number of Male and Female students is almost equal
Number students are greatest in Group C
Number of students who have standard lunch are greater
Number of students who have not enrolled in any test preparation course is greater
Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree

Insights

Student's Performance is related with lunch, race, parental level education
Females lead in pass percentage and also are top-scorers
Student's Performance is not much related with test preparation course
Finishing preparation course is benefitial.

Model Training

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
import warnings

Preparing X and Y variables

X = df.drop(columns=['math_score'],axis=1)
y = df['math_score']

Create Column Transformer with 3 types of transformers


num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)

separate dataset into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

Create an Evaluate Function to give all metrics after model Training

def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

models = {
    "Linear Regression" : LinearRegression(),
    "Lasso" : Lasso(),
    "Ridge" : Ridge(),
    "K-Neighbors Regressor" : KNeighborsRegressor(),
    "Decision Tree" : DecisionTreeRegressor(),
    "Random Forest Regressor" : RandomForestRegressor(),
    "AdaBoost Regressor" : AdaBoostRegressor()
}
model_list=[]
r2_list=[]

for i in range (len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])


    print('Model performance for Training set')
    print('- Root Mean Squared Error: {:.4f}'.format(model_train_rmse))
    print('- Mean Absolute Error: {:.4f}'.format(model_train_mae))
    print('- R2 Score: {:.4f}'.format(model_train_r2))

    print('-'*35)

    print('Model performance for Test set')
    print('- Root Mean Squared Error: {:.4f}'.format(model_test_rmse))
    print('- Mean Absolute Error: {:.4f}'.format(model_test_mae))
    print('- R2 Score: {:.4f}'.format(model_test_r2))
    r2_list.append(model_test_r2)


    print('='*35)
    print('\n')

Model Name	R² Score
Ridge	0.880593
Linear Regression	0.880433
Random Forest Regressor	0.848128
AdaBoost Regressor	0.843365
Lasso	0.825320
K-Neighbors Regressor	0.783446
Decision Tree	0.738964

Plot y_pred and y_test

plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');

Difference between Actual and Predicted Values

pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df

Index	Actual Value	Predicted Value	Difference
521	91	76.387970	14.612030
737	53	58.885970	-5.885970
740	80	76.990265	3.009735
660	74	76.851804	-2.851804
411	84	87.627378	-3.627378
...	...	...	...
408	52	43.409149	8.590851
332	62	62.152214	-0.152214
208	74	67.888395	6.111605
613	65	67.022287	-2.022287
78	61	62.345132	-1.345132

Model Selection Criteria

Models were evaluated using:

Mean Squared Error (MSE)
Mean Absolute Error (MAE)
R² Score (Coefficient of Determination)
Hyperparameter tuning was done using RandomizedSearchCV to optimize performance.

Insights from Data

Parental education level has a significant impact on student performance.
Students who completed test preparation courses performed better.
Lunch type (standard/reduced) also showed a correlation with performance.
Reading and writing scores are closely related, meaning students who do well in one tend to do well in the other.

Installation

To set up the project, follow these steps:

Clone the repository:

git clone https://github.com/your-repo/student-performance-analysis.git
cd student-performance-analysis

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts\activate
pip install -r requirements.txt

Usage

Running the Flask App

To start the Flask application, run:

python app.py

Access the web application at http://127.0.0.1:5000/.

Predicting Student Performance

Enter student details in the form (gender, ethnicity, parental education, lunch, test preparation course, reading & writing scores).
Click the "Predict" button.
The model will predict and display the expected math score.

Conclusion

The analysis provides insights into factors influencing student performance.
The model can help predict student scores and identify at-risk students.
Future improvements can include deep learning models or more advanced feature engineering.

🔗 Check out the full project on GitHub:

👉 GitHub Repository

Happy Coding!

Student Performance Analysis

Project Overview

Problem Statement

Technologies Used

Data Collection

Data Description

Insight

Exploring Data

Analysing how many number of students got full mark in respective field

Insights

Visualize average score distribution to make some conclusion

Insights

Insights

Model Training

Preparing X and Y variables

Create Column Transformer with 3 types of transformers

separate dataset into train and test

Create an Evaluate Function to give all metrics after model Training

Plot y_pred and y_test

Difference between Actual and Predicted Values

Insights from Data

Installation

Usage

Running the Flask App

Predicting Student Performance

Conclusion

🔗 Check out the full project on GitHub: