Student Performance Analysis

Sowmiya Siva - Feb 24 - - Dev Community

Project Overview

  • This project aims to analyze student performance based on various factors such as gender, ethnicity, parental education, lunch type, and test preparation course. The model predicts student scores in reading, writing, and math using machine learning techniques.

Problem Statement

  • This project understand how the student's performance in math, reading, and writing based on demographic and contextual factors such as gender, ethnicity, parental education, lunch type, and test preparation. The goal is to identify patterns and relationships influencing academic achievement.

Technologies Used

  • Programming Language: Python
  • Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
  • Framework: Flask (for deployment)

Data Collection

Data Description

  • gender: The gender of the student (e.g., male, female).
  • race_ethnicity: The group classification of the student (e.g., group A, group B, etc.).
  • parental_level_of_education: The highest level of education attained by the student's parent(s) (e.g., bachelor's degree, some college).
  • lunch: The type of lunch the student receives (e.g., standard, free/reduced).
  • test_preparation_course: Whether the student completed a test preparation course (e.g., none, completed).
  • math_score: The student's score in mathematics.
  • reading_score: The student's score in reading.
  • writing_score: The student's score in writing.

data

data

Insight

  • Consistent Performance: The average scores for math, reading, and writing are relatively close, suggesting a balanced curriculum or consistent evaluation criteria.

  • High Standard Deviation: The standard deviations for all three subjects indicate some variability in performance.

  • Outliers: Scores like 0 in math and 10 in writing might require investigation to check if they are outliers or represent missing/incomplete data

Exploring Data

Analysing how many number of students got full mark in respective field

math_full = df[df['math_score'] == 100]['average_score'].count()
writing_full = df[df['writing_score'] == 100]['average_score'].count()
reading_full = df[df['reading_score'] == 100]['average_score'].count()


print(f'Number of students with full marks in Maths: {math_full}')
print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')
Enter fullscreen mode Exit fullscreen mode
math_less_20 = df[df['math_score'] <= 20]['average_score'].count()
writing_less_20 = df[df['writing_score'] <= 20]['average_score'].count()
reading_less_20 = df[df['reading_score'] <= 20]['average_score'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')
print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')
Enter fullscreen mode Exit fullscreen mode

Insights

  • From above values we get students have performed the worst in Maths
  • Best performance is in reading section

Visualize average score distribution to make some conclusion

pie

Insights

  • Number of Male and Female students is almost equal
  • Number students are greatest in Group C
  • Number of students who have standard lunch are greater
  • Number of students who have not enrolled in any test preparation course is greater
  • Number of students whose parental education is "Some College" is greater followed closely by "Associate's Degree

pair

Insights

  • Student's Performance is related with lunch, race, parental level education
  • Females lead in pass percentage and also are top-scorers
  • Student's Performance is not much related with test preparation course
  • Finishing preparation course is benefitial.

Model Training

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
import warnings
Enter fullscreen mode Exit fullscreen mode

Preparing X and Y variables

X = df.drop(columns=['math_score'],axis=1)
y = df['math_score']
Enter fullscreen mode Exit fullscreen mode

Create Column Transformer with 3 types of transformers


num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)
Enter fullscreen mode Exit fullscreen mode

separate dataset into train and test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape
Enter fullscreen mode Exit fullscreen mode

Create an Evaluate Function to give all metrics after model Training

def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square
Enter fullscreen mode Exit fullscreen mode
models = {
    "Linear Regression" : LinearRegression(),
    "Lasso" : Lasso(),
    "Ridge" : Ridge(),
    "K-Neighbors Regressor" : KNeighborsRegressor(),
    "Decision Tree" : DecisionTreeRegressor(),
    "Random Forest Regressor" : RandomForestRegressor(),
    "AdaBoost Regressor" : AdaBoostRegressor()
}
model_list=[]
r2_list=[]

for i in range (len(list(models))):
    model=list(models.values())[i]
    model.fit(X_train,y_train)

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])


    print('Model performance for Training set')
    print('- Root Mean Squared Error: {:.4f}'.format(model_train_rmse))
    print('- Mean Absolute Error: {:.4f}'.format(model_train_mae))
    print('- R2 Score: {:.4f}'.format(model_train_r2))

    print('-'*35)

    print('Model performance for Test set')
    print('- Root Mean Squared Error: {:.4f}'.format(model_test_rmse))
    print('- Mean Absolute Error: {:.4f}'.format(model_test_mae))
    print('- R2 Score: {:.4f}'.format(model_test_r2))
    r2_list.append(model_test_r2)


    print('='*35)
    print('\n')
Enter fullscreen mode Exit fullscreen mode
Model Name R² Score
Ridge 0.880593
Linear Regression 0.880433
Random Forest Regressor 0.848128
AdaBoost Regressor 0.843365
Lasso 0.825320
K-Neighbors Regressor 0.783446
Decision Tree 0.738964

Plot y_pred and y_test

plt.scatter(y_test,y_pred);
plt.xlabel('Actual');
plt.ylabel('Predicted');
Enter fullscreen mode Exit fullscreen mode

pred

Difference between Actual and Predicted Values

pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred,'Difference':y_test-y_pred})
pred_df
Enter fullscreen mode Exit fullscreen mode
Index Actual Value Predicted Value Difference
521 91 76.387970 14.612030
737 53 58.885970 -5.885970
740 80 76.990265 3.009735
660 74 76.851804 -2.851804
411 84 87.627378 -3.627378
... ... ... ...
408 52 43.409149 8.590851
332 62 62.152214 -0.152214
208 74 67.888395 6.111605
613 65 67.022287 -2.022287
78 61 62.345132 -1.345132

Model Selection Criteria

  • Models were evaluated using:
  1. Mean Squared Error (MSE)

  2. Mean Absolute Error (MAE)

  3. R² Score (Coefficient of Determination)

  4. Hyperparameter tuning was done using RandomizedSearchCV to optimize performance.

Insights from Data

  • Parental education level has a significant impact on student performance.

  • Students who completed test preparation courses performed better.

  • Lunch type (standard/reduced) also showed a correlation with performance.

  • Reading and writing scores are closely related, meaning students who do well in one tend to do well in the other.

Installation

To set up the project, follow these steps:

  1. Clone the repository:
git clone https://github.com/your-repo/student-performance-analysis.git
cd student-performance-analysis
Enter fullscreen mode Exit fullscreen mode

2.

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts\activate
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Usage

Running the Flask App

To start the Flask application, run:

python app.py
Enter fullscreen mode Exit fullscreen mode

Predicting Student Performance

  • Enter student details in the form (gender, ethnicity, parental education, lunch, test preparation course, reading & writing scores).

  • Click the "Predict" button.

  • The model will predict and display the expected math score.

Conclusion

  • The analysis provides insights into factors influencing student performance.

  • The model can help predict student scores and identify at-risk students.

  • Future improvements can include deep learning models or more advanced feature engineering.

🔗 Check out the full project on GitHub:

👉 GitHub Repository

Happy Coding!

. . .