Detect, Defend, Prevail: Payments Fraud Detection using ML & Deepchecks

Jagroop Singh - Jan 13 - - Dev Community

If you are new to machine learning or have just started, you have come to the perfect place!!

Today, we make use of Machine Learning to create a full-fledged Machine Learning project. So, in this project, you will have the opportunity to work with the following technologies and tools:

  • Where we build our model: Google Colab
  • Data Preprocessing : Numpy,Pandas.
  • ML Model Creation: scikit-learn
  • Validation and Testing ML Model: Deepcheck's Platform.

So now we have the items in our toolbox, but we need a problem statement to show how we will use them to create something wonderful.
So, let's put our technologies to work on developing Online Payment Fraud Detection.

Now we know what we're going to use or what's our ultimate outcome will be. So, let's begin :

Step 1: Import libraries which we use in this project :

import pandas as pd
import numpy as np

Enter fullscreen mode Exit fullscreen mode

Step 2: Load Data

So in this project we are using genuine datasets from Kaggle for this project. That dataset is available for download at this link:
Online Payments Fraud Detection

The dataset is ready for use when you download it, rename it (if you want), and upload it to Google Colab:

df = pd.read_csv('payment_fraud_detection.csv')
df.head()
Enter fullscreen mode Exit fullscreen mode

df.head() shows us the top 5 results of csv file as shown :

Data

Step 3: Get familiar with features

Let's explore the features:
step: represents a unit of time where 1 step equals 1 hour
type: type of online transaction
amount: the amount of the transaction
nameOrig: customer starting the transaction
oldbalanceOrg: balance before the transaction
newbalanceOrig: balance after the transaction
nameDest: recipient of the transaction
oldbalanceDest: initial balance of recipient before the transaction
newbalanceDest: the new balance of recipient after the transaction
isFraud: fraud transaction

Step 4: Data Cleaning
There are numerous steps involved in the data cleansing process, but we will focus on the most crucial ones here, which are as follows:

  • Eliminating null data
  • Eliminating rows which doesn't affect if payment is fraud or not.

Let's check is there any null data available using :

df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

We can clearly see that, there is null data available here :

Null Data

We can handle this in 2 ways :

  • Dropping that rows from our dataset ( Not preferable if percentage of null data is more than 1%)
  • Replace null with the mean or median value for numerical data and for categorical data we can use mode.

In the record isFraud is our output variable which is used to check if payment is fraud or not so we can't put any average here also the amount of data is less than 1% so we can drop that using :

df = df.dropna(subset=['newbalanceDest', 'isFraud', 'isFlaggedFraud'])
Enter fullscreen mode Exit fullscreen mode

Now let's check which rows doesn't affect our results if we remove that :
On careful consideration, we have find it out these fields 'isFlaggedFraud','nameOrig','nameDest' doesn't affect to check if payment is fraud or not. So let's remove that :

df.drop(['isFlaggedFraud','nameOrig','nameDest'], axis = 1, inplace = True)
Enter fullscreen mode Exit fullscreen mode

After running that code let's again check our dataframe(df) head using df.head() :

Updated Data

Step 5: Convert Categorical Data into Numerical Data

After careful observation we have found that, only type is non-numerical column. So let's check how many value count it have :

df["type"].value_counts()
Enter fullscreen mode Exit fullscreen mode

We can clearly see that it have 5 unique values as :

Unique values

let's encode this into numerical data :

data = pd.get_dummies(df, columns=['type'], drop_first=True)
data.head()
Enter fullscreen mode Exit fullscreen mode

The modified data now appears as follows:

Data Transformation

Step 6 : Split Features and Target values :
Features : It represents the input variables
Target : It represents output variables.

As it is understood in supervised machine learning, input and output variables must be supplied to the model in order for it to use what it has learned from the dataset to predict future values.

From dateset we can clearly identified that isFraud is Target Variables and remaining columns are Features.

So let's spit that :

X = data.loc[:, data.columns.difference(['isFraud'])].values
y = data.loc[:,"isFraud"].values
Enter fullscreen mode Exit fullscreen mode

Step 6: Split Data into Training and Test Set.

We divide data into training and test sets primarily so that we can use the training data to train our machine learning model and the test data to confirm whether or not the model has been trained correctly.

This can be easily done using scikit-learn :

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.3, random_state = 42)

Enter fullscreen mode Exit fullscreen mode

Here test_size represents the amount of test data we want so here I want 30% as test data from the entire dataset and this train_test_split method will select 30% of test data randomly from dataset.
random_state represents every time we want the same data for training and test set. So this will split data randomly but every time we will run that project or share it to someone this randomness will remains same in that conditions.

Step 7: Training ML Model

This is the main part of our project where we actually building our ML models.

From Target variables we can identify that the result will be either fraud or notFraud. So here we apply classification algorithm.
There are couple of Classification algorithms are available but here we will be using 1 model which is RandomForestClassifier
NOTE : As an assignment one can try out different algorithms and pick best out of it.

Let's apply the algorithm using scikit-learn :

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
Enter fullscreen mode Exit fullscreen mode

Our model is ready for evaluation :

ML Model

Step 8: Model Evaluation
This is a very vital and critical step where we can determine whether our model is ready for use or still need some adjustments.

In the past, writing a tonne of additional code was required for our model's evaluation. However, the Deepchecks platform offers us the current solutions that the world needs today.

With Deepchecks, you can completely verify your data and models from research to production, providing an all-inclusive open-source solution for all your AI & ML validation needs.

Deepchecks

The Deepcheck's features that businesses enjoy the most are listed below:

Evaluation of Data Quality:

  • Finding data that is inconsistent or missing.
  • Finding abnormalities and outliers in the dataset.

Validation of the Model:

  • Examining the fairness and bias of the model.
  • Analysing the model's performance using several metrics.
  • Ensuring the model's stability and ability to adapt well to fresh data.

Interpretability and Explainability:

  • Supplying clarification on model predictions to improve comprehension.
  • Displaying the contribution of features to predictions and their relevance.

Integration and Automation:

  • Streamlining the model deployment process by automating the validation procedure.
  • Integration with widely used deep learning technologies and frameworks.

Let's Integrate it in our platform :

The integration procedure only consists of two steps:

  • Install the library.
  • Copy the code directly from the documentation, adjust the settings, and you're good to go.

Deepchecks Integration

There are lot's of solutions provided here, so today we are using for evaluation of our model. Also we have structured data ( csv format) so we use their Tabular Section.

Algo Selection

Let's install :

pip install deepchecks --upgrade
Enter fullscreen mode Exit fullscreen mode

Installation

After successful installation, let's utilise this to check our model validation. Also one can directly jump into this documentation and try by oneself to integrate it into our project or follow along with me.

  • Let's Create Deepchecks Dataset Object
from deepchecks.tabular import Dataset

train_ds = Dataset(X_train, label=y_train, cat_features=[])
test_ds = Dataset(X_test, label=y_test, cat_features=[])
Enter fullscreen mode Exit fullscreen mode
  • Let's Evaluate our model:
from deepchecks.tabular.suites import model_evaluation

evaluation_suite = model_evaluation()
suite_result = evaluation_suite.run(train_ds, test_ds, rf)
suite_result.show()
Enter fullscreen mode Exit fullscreen mode

It shows us results as :

Results

Let's explore our model's evaluation :
In Didn't Pass section. It explains us it didn't pass some validations during Train-Test Split and it may affect our model.

Didnot Pass Section

No worries we can easily resolve this using Deepchecks Train Test Validation Suite.

This is an exercise for the viewers to integrate it and check this with this our Didn't Pass test clear's. I am sure it surely clear with this.

Let's explore our Passed Section:
In the passed section it provides us lot's of information which we are manually doing by writing code but this platform provides us in literally 5-6 lines of code.

  • Test case report :

Test Case Report

  • Our ROC Curve Plot : It also provide explanation as well that why I loves this platform.

Roc Curve

  • Prediction Drift Graph :

Prediction Drift Graph

  • Simple Model Comparison:

Simple Model Comparison

  • Most Important Confusion Matrix :

Confusion matrix

There are many more items in that report that I haven't included here, but if you follow the code with this, I strongly advise you to verify this throughout your evaluation.

Also if you have any confusion related to it. You can directly go to their discussion section in github :

Github Discussion

With over 3.2K ratings in this repository, Deepcheck offers excellent assistance.

That’s all in this blog. One can also fine-tune this model or even use different algorithms and create personalized ML/AI models. 🤖🧠✨

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .