Skip to content

Latest commit

 

History

History
309 lines (235 loc) · 17.8 KB

File metadata and controls

309 lines (235 loc) · 17.8 KB

🛒 Customer Segmentation with RFM Analysis - SuperStore - Retail by using Python

image

Author: Loc Ha
Date: 2025-04-25

🛠 Tools Used

Python Pandas Matplotlib Seaborn


📑 Table of Contents

I. 📌 Background & Overview
II. 📂 Dataset Description & Data Structure
III. ⚒️ Main Process
IV. 📊 Key Insights & Recommendations


📌 Background & Overview

🏢 Business Context

SuperStore is a fictional retail company based in the United States, specializing in all-occasion gifts.
During the Christmas and New Year season, the Marketing team plans to launch loyalty and promotional campaigns. Manual customer segmentation with Excel is no longer feasible at scale.

🎯 Objective

This project applies RFM (Recency – Frequency – Monetary) analysis to segment customers of SuperStore, aiming to identify valuable customer groups and provide actionable insights for targeted marketing and retention campaigns.

❓ Business Questions:

  • Which customer groups are most valuable to retain?
  • What segments show high churn risk?
  • How can the company increase purchase frequency and customer lifetime value?

📂 Dataset Description & Data Structure

🗂 Dataset Description

  • Source: Online Retail Transaction Dataset (UK-based non-store online retail)
  • Period: 01/12/2010 – 09/12/2011
  • Size: ~500K transactions
  • Format: .csv

🧩 Data Structure

📊 Table Schema (Simplified)

Column Name Description Data Type
InvoiceNo Transaction ID. If starts with 'C' → cancel String/Int
StockCode Product code, unique String/Int
Description Product name String
Quantity Quantity per transaction Int
InvoiceDate Date and time of transaction DateTime
UnitPrice Price per unit (GBP) Float
CustomerID Unique customer ID Int

⚒️ Main Process

🔄 Main Process

  1. Data Preparation: Handle nulls, clean invalid records, convert types.
📌 View Python code for Data Preparation
# EXPLORE DATA
print(main_data.info())
print(main_data.describe())
print(main_data.isnull().sum())

# CORRECT DATA TYPES
main_data['CustomerID'] = main_data['CustomerID'].astype('Int64')
print(main_data['CustomerID'].dtype)
main_data['InvoiceDate'] = pd.to_datetime(main_data['InvoiceDate'], errors='coerce')

# HANDLE NULLS
main_data = main_data[main_data['CustomerID'].notnull()]

# FILTER DATA
main_data = main_data[(main_data['Quantity'] > 0) & (main_data['UnitPrice'] > 0)]
main_data = main_data[
    main_data['StockCode'].str.match(r'^[A-Za-z0-9]{4,7}$') &
    main_data['Description'].str.contains(
        'POSTAGE|Bank Charges|CARRIAGE|Next Day Carriage|Manual',
        case=False, na=False
    )
]
  1. Feature Engineering: Create TotalPrice = Quantity × UnitPrice.
main_data['TotalPrice'] = main_data['Quantity'] * main_data['UnitPrice']
  1. RFM Calculation: Compute Recency, Frequency, Monetary per customer.
📌 View Python code for RFM Calculation
# Reference date for recency
snapshot_date = pd.to_datetime('2011-12-31')

# Recency (days since last purchase)
recency = main_data.groupby('CustomerID')['InvoiceDate'].max().reset_index()
recency['Recency'] = (snapshot_date - recency['InvoiceDate']).dt.days

# Frequency (unique invoices)
frequency = main_data.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
frequency.columns = ['CustomerID', 'Frequency']

# Monetary (total spend)
monetary = main_data.groupby('CustomerID')['TotalPrice'].sum().reset_index()
monetary.columns = ['CustomerID', 'Monetary']

# Combine
rfm = recency[['CustomerID', 'Recency']].merge(frequency, on='CustomerID').merge(monetary, on='CustomerID')
print(rfm.head())
  1. Scoring: Assign quintiles (1–5) for R, F, M and build RFM Score.
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5,4,3,2,1])             # lower recency → higher score
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
  1. Segmentation: Map scores to segments (Champions, Loyal, At Risk, etc.).
📌 View Python code for RFM Segmentation
segment_map = {
    'Champions': ['555','554','544','545','454','455','445'],
    'Loyal': ['543','444','435','355','354','345','344','335'],
    'Potential Loyalist': ['553','551','552','541','542','533','532','531','452','451','442','441','431','453','433','432','423','353','352','351','342','341','333','323'],
    'New Customers': ['512','511','422','421','412','411','311'],
    'Promising': ['525','524','523','522','521','515','514','513','425','424','413','414','415','315','314','313'],
    'Need Attention': ['535','534','443','434','343','334','325','324'],
    'About To Sleep': ['331','321','312','221','213','231','241','251'],
    'At Risk': ['255','254','245','244','253','252','243','242','235','234','225','224','153','152','145','143','142','135','134','133','125','124'],
    'Cannot Lose Them': ['155','154','144','214','215','115','114','113'],
    'Hibernating customers': ['332','322','233','232','223','222','132','123','122','212','211'],
}
def assign_segment(rfm_score):
    for segment, scores in segment_map.items():
        if rfm_score in scores:
            return segment
    return 'Lost customers'

rfm['Segment'] = rfm['RFM_Score'].apply(assign_segment)
print(rfm.head())
  1. Visualization: Histograms, treemap, bar charts, boxplots.
📌 View Python code for Visualization
# Histograms
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Recency'], kde=True); ax.set_title('Distribution of Recency'); plt.show()
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Frequency'], kde=True); ax.set_title('Distribution of Frequency'); plt.show()
fig, ax = plt.subplots(figsize=(12, 3)); sns.histplot(rfm['Monetary'], kde=True); ax.set_title('Distribution of Monetary'); plt.show()

# Treemap of segment share
temp_rfm = rfm.groupby('Segment')['CustomerID'].count().reset_index()
temp_rfm.columns = ['Segment','Cust_count']
temp_rfm['Count_share'] = temp_rfm['Cust_count'] / temp_rfm['Cust_count'].sum()

import squarify
colors = ['#FF0000','#00FFFF','#FFFFAA','#A52A2A','#800080','#00FF00','#808000','#FF0CB3','#FFA500','#FF00FF','#736F6E']
fig, ax = plt.subplots(1, figsize=(15,8))
squarify.plot(
    sizes=temp_rfm['Cust_count'],
    label=temp_rfm['Segment'],
    value=[f'{x*100:.2f}%' for x in temp_rfm['Count_share']],
    alpha=.8, color=colors,
    bar_kwargs=dict(linewidth=1.5, edgecolor="White")
)
plt.title('RFM Segments of Customer Count', fontsize=16); plt.axis('off'); plt.show()

# Countplot by segment
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=rfm, x='Segment', order=rfm['Segment'].value_counts().index, palette='viridis')
plt.title('Number of Customers in Each Segment'); plt.xticks(rotation=45); plt.xlabel('Segment'); plt.ylabel('Number of Customers'); plt.show()

# Boxplots R/F/M by segment
fig, axes = plt.subplots(3, 1, figsize=(15, 18))
sns.boxplot(x='Segment', y='Recency',   data=rfm.sort_values('Segment'), palette='viridis', ax=axes[0]); axes[0].set_title('Recency Distribution by Customer Segment');   axes[0].set_xlabel('Customer Segment'); axes[0].set_ylabel('Recency (Days)'); axes[0].tick_params(axis='x', rotation=45)
sns.boxplot(x='Segment', y='Frequency', data=rfm.sort_values('Segment'), palette='viridis', ax=axes[1]); axes[1].set_title('Frequency Distribution by Customer Segment'); axes[1].set_xlabel('Customer Segment'); axes[1].set_ylabel('Frequency (Number of Orders)'); axes[1].tick_params(axis='x', rotation=45)
sns.boxplot(x='Segment', y='Monetary',  data=rfm.sort_values('Segment'), palette='viridis', ax=axes[2]); axes[2].set_title('Monetary Distribution by Customer Segment');  axes[2].set_xlabel('Customer Segment'); axes[2].set_ylabel('Monetary (Total Price)'); axes[2].tick_params(axis='x', rotation=45)
plt.tight_layout(); plt.show()

📈 Key insight from each chart

Distribution of Recency

image
  • Most customers made a purchase within the last 120 days (~4 months), and the number of customers steadily decreases as the recency period increases.
  • This is a positive signal indicating that the majority of customers are still active and engaged.
  • 💡Recommendation: Focus customer appreciation campaigns on those with low Recency to strengthen loyalty and encourage repeat purchases.

Distribution of Frequency

image
  • During the dataset period, purchase frequency is concentrated at 0–5 times, with very few customers buying more than 5 times and only a handful exceeding 10 times.
  • This indicates that most customers are low-frequency buyers, and repeat purchases are relatively rare.
  • 💡 Recommendation: Encourage customers to increase purchase frequency through loyalty programs such as points accumulation, free shipping for low-value orders, or vouchers for subsequent purchases.

Distribution of Monetary

image
  • Most customers have a Customer Lifetime Value (CLV) below $1000, with the majority concentrated under $500.

  • Beyond $1000, the number of customers drops sharply, and very few exceed higher spending thresholds.

  • 💡 Recommendations:

    • $1–$500: Build shopping habits through discount vouchers for next orders or welcome combos for new customers.
    • $500–$1000: Encourage higher spending with tiered loyalty levels (Silver, Gold, Platinum) and offer bonus points with expiration dates to drive usage.
    • >$1000: Promote premium products with exclusive perks and focus on retaining these customers, nurturing them toward VIP level >$4000.
    • >$4000 (VIPs): Provide VIP/black cards, exclusive personal benefits (dedicated account manager, personalized gifts on birthdays/holidays).
    • One $8000 CLV customer: Requires special care and attention with personalized services and unique rewards.

Customer Segments Analysis (Recency – Frequency – Monetary)

image
Segment Current Status (R/F/M) 💡 Recommendation
Champions Recency <50 days, high Frequency & Monetary (some $1000–3000+) VIP care: loyalty cards, exclusive offers, or personal services.
Loyal Recency ~100 days, medium Frequency & Monetary Maintain benefits, nurture to increase CLV.
Potential Loyalists Recency ~50–100 days, medium Frequency & Monetary, some with high intent Reward consistent purchases with points, aim to convert to Loyal.
Promising Recency ~50–100 days, moderate Frequency & Monetary Upselling with combos and loyalty points to boost CLV.
New Customers Recency ~50–100 days, first-time buyers, small spend Onboarding guides, bundles, discount voucher for next purchase.
Need Attention Recency ~100 days, lower engagement than Loyal Special offers and programs to convert them into Loyal customers.
About To Sleep Avg. Recency ~100 days, very low Frequency & Monetary Send reminder emails with discounts to reactivate purchases.
At Risk Recency ~150–250 days, few transactions, low spend Free gifts with repurchase within 30 days; recommend products from past history.
Hibernating Recency ~100–350 days, declining Frequency & Monetary Reactivation campaigns with strong vouchers and personalized suggestions.
Cannot Lose Them Recency ~200–300 days, low Frequency & low Monetary Provide vouchers/free gifts for purchases within 30 days to win them back.
Lost Customers Recency ~250–350 days, very low Frequency & Monetary Final attempt: 25–50% discount or free products to bring them back.

📊 Company Status & Strategic Priorities

image image
Segment Insights % of Customers Approx. Count 💡 Strategic Recommendation
Champions – Strong loyalty, high frequency & spend 19.7% ~65 🟢 Priority 1: Maintain & grow with VIP care, exclusive offers, loyalty programs
Hibernating – Once active, now disengaged 11.0% ~36 🟡 Priority 2: Reactivation campaigns with 20–30% vouchers, personalized offers
Lost Customers – Long inactivity, low frequency & spend 9.3% ~31 🟡 Priority 2: Attempt win-back with strong discounts or free product trials
Potential Loyalists – Decent spend, medium frequency, growth opportunity 11.4% ~38 🟡 Priority 3: Encourage repeat purchases with points & loyalty tiers
Promising – Recently active, moderate spend 5.1% ~17 🟡 Priority 3: Upselling with bundles, reward points to increase CLV
About To Sleep – Avg. recency ~100 days, very low activity 10.5% ~35 🔴 Priority 4: Send reminders & discounts to prevent churn
At Risk – High recency (150–250 days), low activity 10.1% ~34 🔴 Priority 4: Free gifts or special offers within 30 days to reactivate
New Customers – First-time buyers, low spend 5.1% ~17 🟡 Priority 5: Onboarding guides, discounts to build habits
Loyal – Stable customers, medium frequency & spend 8.1% ~27 🟢 Ongoing: Maintain benefits, upsell premium products
Cannot Lose Them – Very few but critical customers 3.9% ~13 🔴 Immediate care: Exclusive vouchers/gifts to avoid churn

🚀 Strategic Priority Order

  1. 🟢 Maintain & expand Champions
  2. 🟡 Reactivate Hibernating & Lost Customers
  3. 🟡 Develop Potential Loyalists & Promising
  4. 🔴 Prevent churn in About To Sleep & At Risk
  5. 🟡 Support New Customers to form purchase habits

📊 Key Insights & Recommendations

💡 Key Insights

  • Recency: Most customers purchased recently (<120 days), but ~30% fall into At Risk, About To Sleep, Hibernating, Lost.
  • Frequency: Majority purchase ≤5 times; very few are frequent buyers.
  • Monetary: CLV mostly < £500, only a handful > £1000.
  • Segments: Champions (19.7%) and Loyal customers are strong, but churn-prone groups are sizable.

📝 Recommendations

  1. Retain Champions: Exclusive rewards and personal care.
  2. Reactivation Campaigns: Discounts for Hibernating & Lost.
  3. Grow Potential Loyalists: Incentives for repeat purchases.
  4. Address At Risk: Reminder emails, promotions, gifts.
  5. Nurture New Customers: Onboarding and vouchers for retention.