This project delivers an end-to-end credit scoring system for Bati Bankβs Buy-Now-Pay-Later (BNPL) program in partnership with an eCommerce platform. It transforms transactional behavior data into risk signals using interpretable and auditable machine learning models in compliance with Basel II. The solution includes feature engineering, proxy target creation, model training and evaluation, deployment via FastAPI, and automated CI/CD using GitHub Actions and Docker.
- Business Problem
- Credit Scoring Business Understanding
- Data Overview
- Feature Engineering
- Proxy Target Variable
- Modeling and Experiment Tracking
- Evaluation Metrics
- Model Deployment and API
- Testing and CI/CD Pipeline
- Installation
- Usage
- Contributing
- License
- Acknowledgments
Bati Bank aims to launch a BNPL offering via an eCommerce platform. To minimize credit risk and comply with Basel II regulations, a credit scoring model must be developed to assess customer risk based on behavioral transaction data, despite the absence of traditional credit history or default labels.
Basel II requires banks to use risk-sensitive capital measurement systems. This favors interpretable, auditable, and validated models. Techniques such as Logistic Regression with Weight of Evidence (WoE) are preferred due to their transparency and ease of communication with regulators.
In the absence of true default labels, proxy variables (e.g., RFM behavioral segments) are necessary for supervised learning. However, they carry risks:
- Label Risk: Poor proxies lead to misleading predictions.
- Bias Risk: Behavioral proxies may encode socio-economic or demographic bias.
- Regulatory Risk: Decisions based on proxies must be well-documented and justified.
| Feature | Simple Model (LogReg + WoE) | Complex Model (e.g., XGBoost) |
|---|---|---|
| Interpretability | β High | β Low (needs SHAP/LIME) |
| Regulatory Acceptance | β Preferred | β Requires explainability |
| Accuracy | β Moderate | β High |
| Deployment Simplicity | β Easy | β More involved |
We prioritize interpretability and use complex models only when explanations (e.g., SHAP) are provided.
Copy
credit-risk-model/
βββ .github/workflows/ci.yml # CI/CD configuration
βββ data/ # Data storage (add to .gitignore)
β βββ raw/ # Raw data
β βββ processed/ # Processed data for training
βββ notebooks/ # Jupyter notebooks for analysis
β βββ task 1 and 2 # Exploratory data analysis
βββ load_EDA.ipynb
βββ task 3 # Feauture engineering
βββ feature-engineering.ipynb
βββ task 4 # RFMmetrics
βββ RFMmetrics.pynb
βββ task 5 # Modeling
βββ modeling.ipynb
βββ src/ # Source code
β βββ init.py
βββ load.py
βββ RFMmetrics.py.py
βββ saveFile.py.py
βββ visualization.py
β βββ PreProcessing.py # Feature engineering script
β βββ train.py # Model training script
β βββ predict.py # Inference script
β βββ api/
β βββ main.py # FastAPI application
β βββ pydantic_models.py # Pydantic models for API
βββ tests/ # Unit tests
β βββ test_api.py
βββ Dockerfile # Docker configuration
βββ docker-compose.yml # Docker Compose configuration
βββ requirements.txt # Project dependencies
βββ .gitignore # Git ignore file
βββ README.md # Project documentation
βββ LICENSE # Project license
βββ register.py # register
| Column | Description |
|---|---|
| TransactionId | Unique transaction identifier |
| AccountId | Unique customer identifier |
| CustomerId | Shared ID for customer |
| Amount / Value | Transaction value (debit/credit) |
| ChannelId | Platform used (web, Android, iOS) |
| ProductCategory | Grouped product type |
| FraudResult | Fraud label (1: fraud, 0: normal) |
| ... | Other behavioral and demographic features |
Additional engineered features include RFM metrics and time-based aggregates.
We applied the following techniques:
-
Aggregate Features:
- Total transaction count
- Total/avg/std of transaction values
- Transaction recency metrics
-
Extracted Features:
- Hour, day, month of transaction
- Transaction patterns across time
-
Encoding:
- One-Hot Encoding for nominal features
- WoE encoding for regulatory explainability
-
Handling Missing Values:
- Imputation with median/most frequent
- Removal when necessary
-
Scaling:
- StandardScaler for numerical inputs
Since no "default" column exists:
- RFM Metrics were calculated per customer
- K-Means Clustering (k=3) was applied on scaled RFM values
- The least engaged cluster (low frequency & monetary value) was labeled as
is_high_risk = 1 - All other clusters were labeled
0
This binary proxy was added back to the dataset for supervised learning.
-
Models Trained:
- Logistic Regression (with WoE)
- Decision Tree
- Random Forest
- Gradient Boosting
-
Experiment Tracking:
- Tracked all runs with
mlflow - Registered best model in the MLflow Model Registry
- Tracked all runs with
-
Training Pipeline:
sklearn.pipeline.Pipelineused for reproducibility- GridSearchCV for hyperparameter tuning
| Metric | Meaning |
|---|---|
| Accuracy | Overall correct predictions |
| Precision | Correct positive predictions |
| Recall | Ability to detect all actual positives |
| F1-Score | Harmonic average of Precision and Recall |
| ROC AUC | Class separation capacity of model |
The final model was chosen based on best F1-score and ROC AUC.
The model is deployed using FastAPI with Docker support.
{
"Recency": 14,
"Frequency": 5,
"Monetary": 1200,
"Transaction_Hour": 16
}{
"risk_probability": 0.73
}Deployed using:
- FastAPI
- Uvicorn
- Docker
- MLflow model loading
- β
Unit tests with
pytest(tests/folder) - β
Linting with
flake8 - β GitHub Actions for Continuous Integration
# .github/workflows/ci.yml
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run Linter
run: flake8 .
- name: Run Tests
run: pytestgit clone <repository-url>
cd credit-risk-model
pip install -r requirements.txtuvicorn src.api.main:app --reloadVisit: http://localhost:8000/docs for Swagger UI.
Contributions are welcome!
- Fork the repo
- Create a new branch:
git checkout -b feature-branch - Make changes and commit:
git commit -m 'Add feature' - Push:
git push origin feature-branch - Open a Pull Request
This project is licensed under the Apache License 2.0.
See the LICENSE file for more details.
- 10 Academy for the challenge and guidance
- Xente Data (Kaggle) for providing the dataset
- Basel II Accord and HKMA for regulatory guidelines
- Shichen.name Scorecard for WoE & credit scoring tools