📌 Customer Segmentation Using RFM Analysis in Online Retail

Sowmiya Siva - Feb 21 - - Dev Community

🚀 A Data Science Approach to Identifying Key Customer Groups


Introduction

Customer segmentation is essential for businesses to understand and engage with their customers effectively. In this project, we apply Recency, Frequency, and Monetary (RFM) analysis to an Online Retail dataset to segment customers based on their purchasing behavior.

We’ll walk through data cleaning, feature engineering, clustering using K-Means, and customer insights to drive better business decisions.

Dataset Used: Online Retail II Dataset


📊 Data Exploration & Cleaning

Understanding the Dataset

The dataset contains transactions from a UK-based online retailer from 2009 to 2011. Before diving into analysis, we explore and clean the data.

dataset

describe

Key Issues Identified:

  • ✅ Missing values in Customer ID
  • ✅ Negative values in Quantity & Price
  • ✅ Invalid entries in Invoice & StockCode

Cleaning Steps:

# Remove invalid invoices
cleaned_df = df.copy()
cleaned_df = cleaned_df[df["Invoice"].str.match("^\d{6}$") == True]

# Remove non-standard stock codes
cleaned_df = cleaned_df[cleaned_df["StockCode"].str.match("^\d{5}$") == True]
Enter fullscreen mode Exit fullscreen mode

invoice

stockcode

cleane

  • ✅ 27% of records were removed after cleaning to ensure data accuracy.

🔎 Feature Engineering: RFM Metrics

RFM analysis categorizes customers based on:

  • Recency (R): Days since last purchase
  • Frequency (F): Number of purchases
  • Monetary (M): Total spending
aggregated_df = cleaned_df.groupby("Customer ID").agg(
    MonetaryValue=("SalesLineTotal", "sum"),
    Frequency=("Invoice", "nunique"),
    LastInvoiceDate=("InvoiceDate", "max")
)
aggregated_df["Recency"] = (aggregated_df["LastInvoiceDate"].max() - aggregated_df["LastInvoiceDate"]).dt.days
Enter fullscreen mode Exit fullscreen mode
  • ✅ RFM values help us group customers based on their buying behavior.

📈 Data Visualization & Outlier Handling

Analyzing RFM Distributions

We plotted histograms and boxplots to understand the spread of Recency, Frequency, and Monetary values.

plt.hist(aggregated_df["MonetaryValue"], bins=10, color="skyblue", edgecolor="black")
plt.title("Monetary Value Distribution")
plt.xlabel("Monetary Value")
plt.ylabel("Count")
plt.show()
Enter fullscreen mode Exit fullscreen mode

distribution

📌 Key Observations:

  • 🔹 Significant outliers exist in Frequency & Monetary values.
  • 🔹 Customers with extreme spending patterns need special treatment.

Outlier Handling Using IQR

Q1 = aggregated_df["MonetaryValue"].quantile(0.25)
Q3 = aggregated_df["MonetaryValue"].quantile(0.75)
IQR = Q3 - Q1
filtered_df = aggregated_df[(aggregated_df["MonetaryValue"] <= Q3 + 1.5 * IQR)]
Enter fullscreen mode Exit fullscreen mode
  • ✅ Extreme spenders were categorized separately for better insights.

⚡ K-Means Clustering for Customer Segmentation

Finding the Optimal K (Elbow & Silhouette Method)

  • We applied K-Means Clustering to segment customers. The Elbow method & Silhouette score helped us determine the ideal number of clusters.
k_values = range(2, 12)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(scaled_data)
    silhouette_scores.append(silhouette_score(scaled_data, labels))
Enter fullscreen mode Exit fullscreen mode

elbow

  • ✅ The optimal K = 4 was selected.

🚀 Customer Segments & Business Insights

  • After clustering, we analyzed customer groups and their business implications.

in-1

in-2

Cluster Label Characteristics Business Strategy
0 Loyal Retainers High-value, frequent buyers Retention programs, exclusive discounts
1 Re-Engagement Targets Infrequent buyers, lower spending Re-targeting ads, special promotions
2 Growth Potential Recent buyers, low spending Upsell strategies, better recommendations
3 Top Performers High-frequency, high-value buyers VIP programs, premium services

📌 Key Actions:

  • ✔ Offer personalized marketing for Loyal Retainers & Top Performers
  • ✔ Use discount strategies to re-engage dormant customers
  • ✔ Implement recommendation engines for Growth Potential customers

📌 Conclusion

🎯 Takeaways:

  • RFM analysis effectively segments customers based on behavior.
  • K-Means clustering identifies distinct customer groups for better engagement.
  • Business insights help improve marketing & retention strategies.

🚀 Future Work:

  • 🔹 Apply hierarchical clustering for better segmentation.
  • 🔹 Integrate predictive modeling for dynamic customer targeting.

🔗 Check out the full project on GitHub:

👉 GitHub Repository

. . .