Project Overview
- This project aims to analyze student performance based on various factors such as gender, ethnicity, parental education, lunch type, and test preparation course. The model predicts student scores in reading, writing, and math using machine learning techniques.
Problem Statement
- This project understand how the student's performance in math, reading, and writing based on demographic and contextual factors such as gender, ethnicity, parental education, lunch type, and test preparation. The goal is to identify patterns and relationships influencing academic achievement.
Technologies Used
- Programming Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- Framework: Flask (for deployment)
Data Collection
- Dataset source - https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/code
- The dataset consists of 8 column and 1000 rows
Data Description
- gender: The gender of the student (e.g., male, female).
- race_ethnicity: The group classification of the student (e.g., group A, group B, etc.).
- parental_level_of_education: The highest level of education attained by the student's parent(s) (e.g., bachelor's degree, some college).
- lunch: The type of lunch the student receives (e.g., standard, free/reduced).
- test_preparation_course: Whether the student completed a test preparation course (e.g., none, completed).
- math_score: The student's score in mathematics.
- reading_score: The student's score in reading.
- writing_score: The student's score in writing.
Insight
Consistent Performance: The average scores for math, reading, and writing are relatively close, suggesting a balanced curriculum or consistent evaluation criteria.
High Standard Deviation: The standard deviations for all three subjects indicate some variability in performance.
Outliers: Scores like 0 in math and 10 in writing might require investigation to check if they are outliers or represent missing/incomplete data
Exploring Data
Analysing how many number of students got full mark in respective field
math_full = df[df['math_score'] == 100]['average_score'].count()
writing_full = df[df['writing_score'] == 100]['average_score'].count()
reading_full = df[df['reading_score'] == 100]['average_score'].count()
print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')
math_less_20 = df[df['math_score'] <= 20]['average_score'].count()
writing_less_20 = df[df['writing_score'] <= 20]['average_score'].count()
reading_less_20 = df[df['reading_score'] <= 20]['average_score'].count()
print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')
Insights
- From above values we get students have performed the worst in Maths
- Best performance is in reading section
Visualize average score distribution to make some conclusion
Insights
- Number of Male and Female students is almost equal
- Number students are greatest in Group C
- Number of students who have standard lunch are greater
- Number of students who have not enrolled in any test preparation course is greater
- Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree
Insights
- Student's Performance is related with lunch, race, parental level education
- Females lead in pass percentage and also are top-scorers
- Student's Performance is not much related with test preparation course
- Finishing preparation course is benefitial.
Model Training
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
import warnings
Preparing X and Y variables
X = df.drop(columns=['math_score'],axis=1)
y = df['math_score']
Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()
preprocessor = ColumnTransformer(
[
("OneHotEncoder", oh_transformer, cat_features),
("StandardScaler", numeric_transformer, num_features),
]
)
separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape
Create an Evaluate Function to give all metrics after model Training
def evaluate_model(true, predicted):
mae = mean_absolute_error(true, predicted)
mse = mean_squared_error(true, predicted)
rmse = np.sqrt(mean_squared_error(true, predicted))
r2_square = r2_score(true, predicted)
return mae, rmse, r2_square
models = {
"Linear Regression" : LinearRegression(),
"Lasso" : Lasso(),
"Ridge" : Ridge(),
"K-Neighbors Regressor" : KNeighborsRegressor(),
"Decision Tree" : DecisionTreeRegressor(),
"Random Forest Regressor" : RandomForestRegressor(),
"AdaBoost Regressor" : AdaBoostRegressor()
}
model_list=[]
r2_list=[]
for i in range (len(list(models))):
model=list(models.values())[i]
model.fit(X_train,y_train)
# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluate Train and Test dataset
model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)
print(list(models.keys())[i])
model_list.append(list(models.keys())[i])
print('Model performance for Training set')
print('- Root Mean Squared Error: {:.4f}'.format(model_train_rmse))
print('- Mean Absolute Error: {:.4f}'.format(model_train_mae))
print('- R2 Score: {:.4f}'.format(model_train_r2))
print('-'*35)
print('Model performance for Test set')
print('- Root Mean Squared Error: {:.4f}'.format(model_test_rmse))
print('- Mean Absolute Error: {:.4f}'.format(model_test_mae))
print('- R2 Score: {:.4f}'.format(model_test_r2))
r2_list.append(model_test_r2)
print('='*35)
print('\n')
Model Name | R² Score |
---|---|
Ridge | 0.880593 |
Linear Regression | 0.880433 |
Random Forest Regressor | 0.848128 |
AdaBoost Regressor | 0.843365 |
Lasso | 0.825320 |
K-Neighbors Regressor | 0.783446 |
Decision Tree | 0.738964 |
Plot y_pred and y_test
plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');
Difference between Actual and Predicted Values
pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df
Index | Actual Value | Predicted Value | Difference |
---|---|---|---|
521 | 91 | 76.387970 | 14.612030 |
737 | 53 | 58.885970 | -5.885970 |
740 | 80 | 76.990265 | 3.009735 |
660 | 74 | 76.851804 | -2.851804 |
411 | 84 | 87.627378 | -3.627378 |
... | ... | ... | ... |
408 | 52 | 43.409149 | 8.590851 |
332 | 62 | 62.152214 | -0.152214 |
208 | 74 | 67.888395 | 6.111605 |
613 | 65 | 67.022287 | -2.022287 |
78 | 61 | 62.345132 | -1.345132 |
Model Selection Criteria
- Models were evaluated using:
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
R² Score (Coefficient of Determination)
Hyperparameter tuning was done using RandomizedSearchCV to optimize performance.
Insights from Data
Parental education level has a significant impact on student performance.
Students who completed test preparation courses performed better.
Lunch type (standard/reduced) also showed a correlation with performance.
Reading and writing scores are closely related, meaning students who do well in one tend to do well in the other.
Installation
To set up the project, follow these steps:
- Clone the repository:
git clone https://github.com/your-repo/student-performance-analysis.git
cd student-performance-analysis
2.
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
pip install -r requirements.txt
Usage
Running the Flask App
To start the Flask application, run:
python app.py
- Access the web application at http://127.0.0.1:5000/.
Predicting Student Performance
Enter student details in the form (gender, ethnicity, parental education, lunch, test preparation course, reading & writing scores).
Click the "Predict" button.
The model will predict and display the expected math score.
Conclusion
The analysis provides insights into factors influencing student performance.
The model can help predict student scores and identify at-risk students.
Future improvements can include deep learning models or more advanced feature engineering.
🔗 Check out the full project on GitHub:
Happy Coding!