Skip to content

This project aims to develop an end-to-end credit scoring system for Bati Bank to assess the creditworthiness of customers using a Buy-Now-Pay-Later (BNPL) service offered in partnership with an eCommerce platform.

License

Notifications You must be signed in to change notification settings

sumeyaaaa/Credit-Risk-Probability-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Credit Risk Probability Model

Executive Summary

This project delivers an end-to-end credit scoring system for Bati Bank’s Buy-Now-Pay-Later (BNPL) program in partnership with an eCommerce platform. It transforms transactional behavior data into risk signals using interpretable and auditable machine learning models in compliance with Basel II. The solution includes feature engineering, proxy target creation, model training and evaluation, deployment via FastAPI, and automated CI/CD using GitHub Actions and Docker.


Table of Contents

  1. Business Problem
  2. Credit Scoring Business Understanding
  3. Data Overview
  4. Feature Engineering
  5. Proxy Target Variable
  6. Modeling and Experiment Tracking
  7. Evaluation Metrics
  8. Model Deployment and API
  9. Testing and CI/CD Pipeline
  10. Installation
  11. Usage
  12. Contributing
  13. License
  14. Acknowledgments

1. Business Problem

Bati Bank aims to launch a BNPL offering via an eCommerce platform. To minimize credit risk and comply with Basel II regulations, a credit scoring model must be developed to assess customer risk based on behavioral transaction data, despite the absence of traditional credit history or default labels.


2. Credit Scoring Business Understanding

2.1 Basel II and Interpretability

Basel II requires banks to use risk-sensitive capital measurement systems. This favors interpretable, auditable, and validated models. Techniques such as Logistic Regression with Weight of Evidence (WoE) are preferred due to their transparency and ease of communication with regulators.

2.2 Proxy Variables in Place of Defaults

In the absence of true default labels, proxy variables (e.g., RFM behavioral segments) are necessary for supervised learning. However, they carry risks:

  • Label Risk: Poor proxies lead to misleading predictions.
  • Bias Risk: Behavioral proxies may encode socio-economic or demographic bias.
  • Regulatory Risk: Decisions based on proxies must be well-documented and justified.

2.3 Simple vs. Complex Models

Feature Simple Model (LogReg + WoE) Complex Model (e.g., XGBoost)
Interpretability βœ… High ❌ Low (needs SHAP/LIME)
Regulatory Acceptance βœ… Preferred ⚠ Requires explainability
Accuracy ⚠ Moderate βœ… High
Deployment Simplicity βœ… Easy ⚠ More involved

We prioritize interpretability and use complex models only when explanations (e.g., SHAP) are provided.


3. Data Overview

3.1 Project Structure

Copy credit-risk-model/ β”œβ”€β”€ .github/workflows/ci.yml # CI/CD configuration β”œβ”€β”€ data/ # Data storage (add to .gitignore) β”‚ β”œβ”€β”€ raw/ # Raw data β”‚ └── processed/ # Processed data for training β”œβ”€β”€ notebooks/ # Jupyter notebooks for analysis β”‚ └── task 1 and 2 # Exploratory data analysis └── load_EDA.ipynb
└── task 3 # Feauture engineering └── feature-engineering.ipynb
└── task 4 # RFMmetrics └── RFMmetrics.pynb
└── task 5 # Modeling └── modeling.ipynb
β”œβ”€β”€ src/ # Source code β”‚ β”œβ”€β”€ init.py β”œβ”€β”€ load.py β”œβ”€β”€ RFMmetrics.py.py β”œβ”€β”€ saveFile.py.py β”œβ”€β”€ visualization.py β”‚ β”œβ”€β”€ PreProcessing.py # Feature engineering script β”‚ β”œβ”€β”€ train.py # Model training script β”‚ β”œβ”€β”€ predict.py # Inference script β”‚ └── api/ β”‚ β”œβ”€β”€ main.py # FastAPI application β”‚ └── pydantic_models.py # Pydantic models for API β”œβ”€β”€ tests/ # Unit tests β”‚ └── test_api.py β”œβ”€β”€ Dockerfile # Docker configuration β”œβ”€β”€ docker-compose.yml # Docker Compose configuration β”œβ”€β”€ requirements.txt # Project dependencies β”œβ”€β”€ .gitignore # Git ignore file └── README.md # Project documentation └── LICENSE # Project license └── register.py # register

3.1 Data Overview

Column Description
TransactionId Unique transaction identifier
AccountId Unique customer identifier
CustomerId Shared ID for customer
Amount / Value Transaction value (debit/credit)
ChannelId Platform used (web, Android, iOS)
ProductCategory Grouped product type
FraudResult Fraud label (1: fraud, 0: normal)
... Other behavioral and demographic features

Additional engineered features include RFM metrics and time-based aggregates.


4. Feature Engineering

We applied the following techniques:

  • Aggregate Features:

    • Total transaction count
    • Total/avg/std of transaction values
    • Transaction recency metrics
  • Extracted Features:

    • Hour, day, month of transaction
    • Transaction patterns across time
  • Encoding:

    • One-Hot Encoding for nominal features
    • WoE encoding for regulatory explainability
  • Handling Missing Values:

    • Imputation with median/most frequent
    • Removal when necessary
  • Scaling:

    • StandardScaler for numerical inputs

5. Proxy Target Variable

Since no "default" column exists:

  1. RFM Metrics were calculated per customer
  2. K-Means Clustering (k=3) was applied on scaled RFM values
  3. The least engaged cluster (low frequency & monetary value) was labeled as is_high_risk = 1
  4. All other clusters were labeled 0

This binary proxy was added back to the dataset for supervised learning.


6. Modeling and Experiment Tracking

  • Models Trained:

    • Logistic Regression (with WoE)
    • Decision Tree
    • Random Forest
    • Gradient Boosting
  • Experiment Tracking:

    • Tracked all runs with mlflow
    • Registered best model in the MLflow Model Registry
  • Training Pipeline:

    • sklearn.pipeline.Pipeline used for reproducibility
    • GridSearchCV for hyperparameter tuning

7. Evaluation Metrics

Metric Meaning
Accuracy Overall correct predictions
Precision Correct positive predictions
Recall Ability to detect all actual positives
F1-Score Harmonic average of Precision and Recall
ROC AUC Class separation capacity of model

The final model was chosen based on best F1-score and ROC AUC.


8. Model Deployment and API

The model is deployed using FastAPI with Docker support.

πŸ› οΈ Endpoint: /predict

Request Format

{
  "Recency": 14,
  "Frequency": 5,
  "Monetary": 1200,
  "Transaction_Hour": 16
}

Response

{
  "risk_probability": 0.73
}

Deployed using:

  • FastAPI
  • Uvicorn
  • Docker
  • MLflow model loading

9. Testing and CI/CD Pipeline

  • βœ… Unit tests with pytest (tests/ folder)
  • βœ… Linting with flake8
  • βœ… GitHub Actions for Continuous Integration
# .github/workflows/ci.yml
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run Linter
        run: flake8 .
      - name: Run Tests
        run: pytest

10. Installation

git clone <repository-url>
cd credit-risk-model
pip install -r requirements.txt

11. Usage

Run API locally

uvicorn src.api.main:app --reload

Visit: http://localhost:8000/docs for Swagger UI.


12. Contributing

Contributions are welcome!

  1. Fork the repo
  2. Create a new branch: git checkout -b feature-branch
  3. Make changes and commit: git commit -m 'Add feature'
  4. Push: git push origin feature-branch
  5. Open a Pull Request

13. License

This project is licensed under the Apache License 2.0.
See the LICENSE file for more details.


14. Acknowledgments


About

This project aims to develop an end-to-end credit scoring system for Bati Bank to assess the creditworthiness of customers using a Buy-Now-Pay-Later (BNPL) service offered in partnership with an eCommerce platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published