Skip to content

goamegah/aws-realtime-fraud-detection

Repository files navigation

Real-time Fraud Detection on AWS

Detecting potential fraud in financial systems is a major challenge for organizations worldwide. Building robust solutions that enable real-time actions is essential for companies aiming to provide greater security to their customers during financial transactions.

This repository demonstrates a complete machine learning pipeline for credit card fraud detection using the Kaggle Credit Card Fraud Detection dataset, which contains 284,807 European cardholder transactions from 2013 (including 492 fraudulent cases) with 28 PCA-transformed features plus original Amount and Time variables.

The project showcases a production-ready streaming architecture that integrates Amazon SageMaker for training both supervised and unsupervised ML models and deploying them as managed endpoints. The complete AWS solution includes:

  • Training of supervised and unsupervised ML models and deployment to a managed-endpoint using Amazon SageMaker
  • REST API deployment via Chalice (Lambda + API Gateway)
  • Streaming data pipeline (Kinesis → Spark/Glue → RDS)
  • (Optional) Interactive dashboard for real-time fraud monitoring and analysis.

Architecture overview:

Architecture

Useful links:

Project structure

aws-realtime-fraud-detection/
├── app/                 
│   ├── chalice/                  # Serverless API (Chalice)
│   └── streamlit/                # Dashboard (Streamlit)
├── assets/                       # Images, diagrams
├── devops/infra/                 # Infrastructure-as-Code (Terraform, etc.)
├── docs/                         # Documentation
├── scripts/                      # Data generation (client simulator)
├── src/fraudit/                  # Streaming pipeline & utilities
│   ├── jobs/elt/                 # Schema, transformations, loading
│   └── utils/                    # PostgreSQL DDL, logging, etc.
├── dataset/                      # Local datasets (e.g., creditcard.csv)
├── docker-compose.yml            # Launching the dashboard (optional)
└── pyproject.toml                # Package configuration (single source of truth)

Prerequisites

Quick start

  • Setup your virtual env and install required packages
$ uv sync
  • Setup your AWS credentials
$ aws configure
  • Provision AWS resources
$ make tf.init
$ make tf.plan
$ make tf.apply

Setup Chalice configuration file app/chalice/.chalice/config.json using lambda execution role given by terraform output. After that you can provision Lambda and API GateAway and deploy your API app on Lambda.

$ make chalice.deploy
  • Deployment Build and deploy fraudit package wheel and Glue job artifacts to S3 for AWS Glue job consumption.
$ make deploy.glue

AWS SageMaker

Sagemaker is used to train and deploy the ML models. The training and deployment notebooks are located in the sagemaker/ folder.

Inference API (Chalice)

  • Route: POST /predict

Setup

  1. Setup Chalice configuration file: app/chalice/.chalice/config.json
{
    "version": "2.0",
    "app_name": "ml-inference-api",
    "stages": {
        "dev": {
            "api_gateway_stage": "api",
            "manage_iam_role": false,
            "iam_role_arn": "<terraform_lambda_exec_role_arn_output>",
            "environment_variables": {
                "solution_prefix": "fraud-detection",
                "stream_name": "fraud-predictions-stream",
                "aws_region": "eu-west-1"
            }
        }
    }
}
  1. (Optional) Test Chalice deployment locally
$ chalice local --port 8000 # Optional -> urls: http://localhost:8000/
  1. Deploy Chalice app to AWS Lambda
$ cd app/chalice
$ chalice deploy

Minimal transaction example

  • JSON input (minimal example):
{
  "metadata": {
    "timestamp": "2025-08-21T17:45:00Z",
    "user_id": "u_123",
    "source": "checkout",
    "device_info": {"device_type": "mobile", "os_version": "iOS 17", "app_version": "2.4.1"},
    "ip_address": "203.0.113.10",
    "geo": {"country": "fr", "region": "IDF", "city": "Paris", "latitude": 48.85, "longitude": 2.35}
  },
  "data": "0.12, 50.3, 1, 0, 3, ..."
}
  • API Output (excerpt):
{
  "anomaly_detector": {"score": 0.02},
  "fraud_classifier": {"pred_proba": 0.13, "prediction": 0}
}

Environment variables

Create a .env file at the repo root (do not commit secrets). Tip: use a .env.example without secrets in the repo and keep your .env locally.

Spark Streaming job

Glue Job Deployment

  1. Install the build package
$ python3 -m pip install build
  1. Package the project (wheel)
$ python3 -m build

This will result in a wheel file fraudit-0.0.1-py3-none-any.whl in the dist/ directory.

  1. Deploy the job, wheel and Kinesis connector for Spark to their respective AWS S3 for Glue

    Tip: See devops/infra/main/glue.tf --additional-python-modules and --extra-jars Terraform options for more details.

$ aws s3 cp dist/fraudit-0.0.1-py3-none-any.whl s3://credit-card-fraud-detection-spark-streaming-bucket/wheel/fraudit-0.0.1-py3-none-any.whl
$ aws s3 cp src/fraudit/glue_job.py s3://credit-card-fraud-detection-spark-streaming-bucket/spark-jobs/
$ aws s3 cp src/resources/spark-streaming-sql-kinesis-connector_2.12-1.0.0 s3://credit-card-fraud-detection-spark-streaming-bucket/jars/spark-streaming-sql-kinesis-connector_2.12-1.0.0
  1. Once the artifacts are uploaded, you can start the Glue job from the console, ensuring the default arguments defined in glue.tf are set.

Local Job Running

  1. Download and setup Apache Spark locally. To do so, refer to the spark installation guide.
  2. Make sure environment variables are set in .env.
  3. Run the job
$ python fraudit.main

The job reads the Kinesis stream (KINESIS_STREAM), transforms the data (src/fraudit/jobs/elt/transform.py), and appends into the fraud_predictions table.

Simulated data generation

Prerequisites: .env with CHALICE_API_URL and dataset/creditcard.csv present.

$ python -m pip install -e .[scripts]
$ python scripts/generate_data.py
  • PARALLEL_INVOCATION in scripts/generate_data.py allows sending in parallel.
  • Adjust max_requests according to desired throughput.

Dashboard (Streamlit)

  • Via Docker:
$ docker compose up dashboard
  • Or locally:
$ cd streamlit
$ pip install -r requirements.txt
$ streamlit run app.py

Ensure POSTGRES_HOST/DB/USER/PASSWORD/PORT are configured.

Clean up

  • Destroy the infrastructure:
$ cd devops/infra/main && terraform destroy
  • Delete the Chalice API:
$ cd app/chalice && chalice delete

Troubleshooting

  • Error "Missing required environment variables" when starting locally: check your .env (see variables above).
  • Kinesis connector not found: set KINESIS_CONNECTOR_PATH to the JAR.
  • API 4xx/5xx during generation: check CHALICE_API_URL and quotas; reduce PARALLEL_INVOCATION.
  • Do not commit secrets in .env.

License

Educational/demo project. Adapt before production use.