End-to-end credit card fraud detection system with ML training pipeline, model evaluation, REST API, and Docker deployment.
Credit card fraud costs the industry billions annually. Detecting fraudulent transactions in real time is critical, but the extreme class imbalance (fraud is only ~0.17% of transactions) makes this a challenging classification problem. FraudShield tackles this by:
- Engineering robust features from raw transaction data
- Comparing multiple classifiers with multiple imbalance-handling strategies
- Optimizing the decision threshold for the best F1 score
- Serving predictions via a low-latency REST API
| Stat | Value |
|---|---|
| Source | Kaggle Credit Card Fraud Detection |
| Transactions | 284,807 |
| Fraud rate | 0.172% (492 fraudulent) |
| Features | V1-V28 (PCA), Time, Amount |
| Target | Class (0 = legit, 1 = fraud) |
ROC-AUC can be misleadingly high on imbalanced datasets because the true negative rate (specificity) dominates the curve. With 99.83% legitimate transactions, a model that predicts "legit" for everything achieves ~0.998 specificity.
PR-AUC (Precision-Recall Area Under Curve) focuses exclusively on the positive (fraud) class, making it a far more informative metric when the positive class is rare. A random classifier would score ~0.0017 PR-AUC (the prevalence rate), so any meaningful PR-AUC value represents real detection capability.
- Amount_log:
log1p(Amount)compresses the heavy right tail of transaction amounts - hour_sin / hour_cos: Cyclical encoding of the transaction hour preserves the circular nature of time (hour 23 is close to hour 0)
- Raw
TimeandAmountcolumns are dropped after transformation
| Strategy | How it works |
|---|---|
class_weight / scale_pos_weight |
Penalizes misclassifying the minority class more heavily during training |
| SMOTE | Generates synthetic fraud samples in feature space (applied to training set only) |
| Random Undersampling | Downsamples the majority class to balance the training set |
All 4 models are trained with all 3 strategies (12 combinations total), evaluated on a held-out 15% test set:
| Model | Strategy | ROC-AUC | PR-AUC | F1 | Precision | Recall |
|---|---|---|---|---|---|---|
| LightGBM | class_weight | ~0.98 | ~0.80 | ~0.85 | ~0.88 | ~0.82 |
| XGBoost | class_weight | ~0.98 | ~0.79 | ~0.84 | ~0.87 | ~0.81 |
| RandomForest | class_weight | ~0.97 | ~0.76 | ~0.82 | ~0.86 | ~0.79 |
| LogReg | class_weight | ~0.97 | ~0.70 | ~0.73 | ~0.75 | ~0.71 |
Note: Exact numbers depend on the random split. Run python -m src.train to see your results.
The winning model (typically LightGBM or XGBoost) is then hyperparameter-tuned via RandomizedSearchCV (20 iterations, 5-fold stratified CV, scoring by average_precision).
Instead of using the default 0.5 threshold, we sweep the Precision-Recall curve to find the threshold that maximizes F1 on the test set. This typically lands around 0.4-0.6 depending on the model.
FraudShield/
├── src/
│ ├── preprocess.py # Feature engineering
│ ├── train.py # Model comparison (4 models x 3 strategies)
│ ├── evaluate.py # Threshold optimization + plots
│ └── train_final.py # Hyperparameter tuning + artifact export
├── api/
│ ├── schema.py # Pydantic v2 request/response models
│ └── main.py # FastAPI inference server
├── models/ # Saved model artifacts (.joblib)
├── plots/ # Evaluation plots (PR curve, ROC, CM, SHAP)
├── tests/
│ └── test_api.py # API tests (pytest + TestClient)
├── data/ # Place creditcard.csv here
├── Dockerfile
├── requirements.txt
└── README.md
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -r requirements.txtDownload creditcard.csv from Kaggle and place it in the data/ directory.
# Run the full comparison pipeline (12 models)
python -m src.train --data data/creditcard.csv
# Evaluate best model + generate plots
python -m src.evaluate --data data/creditcard.csv
# Tune the final model and save the artifact
python -m src.train_final --data data/creditcard.csv --output models/fraud_model.joblibuvicorn api.main:app --host 0.0.0.0 --port 8000 --reloadpytest tests/ -vcurl http://localhost:8000/healthResponse:
{
"status": "healthy",
"model_version": "1.0.0"
}curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"amount_log": 4.02,
"hour_sin": 0.5,
"hour_cos": 0.866,
"v_features": [
-1.36, -0.07, 2.54, 1.38, -0.34, 0.46, 0.24, 0.10,
0.36, 0.09, -0.55, -0.62, -0.99, -0.31, 1.47, -0.47,
0.21, 0.03, 0.40, 0.25, -0.02, 0.28, -0.11, 0.07,
0.13, -0.19, 0.13, -0.02
]
}'Response:
{
"fraud_probability": 0.032451,
"is_fraud": false,
"risk_level": "LOW"
}| Level | Condition |
|---|---|
| LOW | probability < threshold |
| MEDIUM | threshold <= probability <= 0.8 |
| HIGH | probability > 0.8 |
docker build -t fraudshield .
docker run -p 8000:8000 fraudshieldOne-liner (build + run):
docker build -t fraudshield . && docker run -p 8000:8000 fraudshielddocker run -p 8000:8000 -v $(pwd)/models:/app/models fraudshieldAfter running python -m src.evaluate, the plots/ directory will contain:
| Plot | Description |
|---|---|
precision_recall_curve.png |
PR curve with optimal F1 threshold marked |
roc_curve.png |
ROC curve with AUC |
confusion_matrix.png |
Heatmap at optimal threshold |
shap_summary.png |
SHAP feature importance for the fraud class |
MIT