Category Magic: Transforming Categorical Data in ML

Jagroop Singh - Nov 4 '23 - - Dev Community

If you are learning machine learning, then during the data preprocessing step, if the data contains categorical data that is significant for forecasting output or the dependent variable is in categorical form, we must turn that data into numerical form.This process is known as Encoding.

Why we transform Categorical Data into Numerical Form ?

We all know that machines only understand 0's and 1's, and machine learning algorithms are no exception. It works well with numerical data. So, before we feed data to the Algorithm, we must encode the code into numerical form.

How we Encode Categorical Data ?

For transformation of Categorical Data into Numerical data we use concept of Dummy Variables.

Dummy Varibale

Dummy Variable is a Binary variable that accepts 0's and 1's as indicated in the above form. Categorical data [India, Japan, South Korea] is represented by 0's and 1's.

If you wish to learn more about Dummy Varible, please click on the following link:
Concept of Dummy Variable and Dummy Variable Trap

Implementation of conversion of Categorical Data into Numerical Data using scikit-learn:

Let's use dummy data to demonstrate encoding in scikit-learn as:

Data for Machine Learning

Step 1 : Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

Step 2: Importing the dataset
You can download the dataset from here.

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values
Enter fullscreen mode Exit fullscreen mode

Here, I separated the independent variables (X) from the dependent variable (y).

Step 3 : Taking care of Missing Data

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
Enter fullscreen mode Exit fullscreen mode

Step 4 : Encoding Categorical Data

We have reached this step at last.
Upon examining our data, we see that both our dependent variable (y)Β and Independent variable (X, or country) contain categorical data .

So, here we have to notice one thing which is we don't want to convert our output variable (y) into number of columns using dummy variable instead we must assigns a unique numerical label to each category, preserving the ordinal relationship if present.

For example :

Label Encoding

Above conversion is known as Label encoding where we assigns a unique numerical label to each category, preserving the ordinal relationship if present.

So let's convert Dependent Variable (y) into Numerical value using Label Encoding.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = np.array(le.fit_transform(y))
Enter fullscreen mode Exit fullscreen mode

It's output is :

[0 1 0 0 1 1 0 1 0 1]
Enter fullscreen mode Exit fullscreen mode

Now let's convert our Independent variable (X)(only Country column) into Numerical data . For this we use One-Hot Encoding algorithm.

One-Hot Encoding converts each categorical value into a binary vector, creating new binary columns for each category.

This can be done as :

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
X = np.array(ct.fit_transform(X))
Enter fullscreen mode Exit fullscreen mode

It's output is :

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]
Enter fullscreen mode Exit fullscreen mode

πŸŽ‰ Tadaa! πŸŽ‰ Finally, we've mastered categorical encoding! πŸ“ŠπŸ’»πŸ”€πŸ’‘

Here's a quick summary of what we've learned:

One-Hot Encoding πŸ”₯πŸ†•:

  • We create binary columns for each category.
  • Assign a '1' to the category that applies and '0' to others.
    Label Encoding πŸ·οΈπŸ”’:

  • We replace categories with numerical values.

After that we can continue further process like feature scaling, training model, testing model etc.

πŸ‘‰ You can access the full code from this GitHub repository: Link to Repository

Feel free to explore the code and learn more about categorical encoding! πŸ”πŸ’»πŸ“‚πŸ“πŸ˜Š

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .