Skip to content

A serverless MapReduce data pipeline built with Lambda, Step Functions, and S3. Supports large JSONL processing with streaming I/O, hash-based partitioning, and scalable fan-out execution, packaged with modular SAM templates and full test coverage.

License

Notifications You must be signed in to change notification settings

Venkat-Gorla/mapreduce-lambda-aws

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š MapReduce Lambda Pipeline for URL Counting

Python AWS Tests License

A lightweight, fully serverless MapReduce-style data processing pipeline built on AWS Lambda, Step Functions (Express Workflows), and Amazon S3.
It processes large volumes of JSONL input files containing URLs and produces aggregated counts per unique URL β€” with no EMR, no EC2, and no servers to manage.

πŸ“‘ Table of Contents

πŸš€ Features

This project demonstrates how to process large datasets without running out of memory, using:

  • Zero manual compute
  • Horizontal scaling via Step Functions Map state
  • Fan-out Mapper -> centralized Grouping -> fan-out Reducer
  • Hash-based bucketing (map phase)
  • Bucket-level aggregation (reduce phase)
  • Streaming reads from S3
  • Memory-safe chunk processing
  • Temporary + final S3 storage layers
  • AWS SAM (both monolithic and modular nested stacks)

πŸ— Tech Stack

Component Description
Python 3.13 Core language
Amazon S3 Storage layer
AWS Lambda Compute layer
Step Functions Workflow orchestration
AWS SAM Infrastructure deployment
boto3 AWS SDK
pytest Testing framework
uv Lightweight dependency tool

πŸ“ Project Structure

src/mapreduce_lambda_aws/
β”‚
β”œβ”€β”€ mapper/
β”‚   └── handler.py          # Mapper Lambda
β”‚
β”œβ”€β”€ grouping/
β”‚   └── handler.py          # GroupOutputs Lambda (organizes reducer jobs)
β”‚
β”œβ”€β”€ reducer/
β”‚   └── handler.py          # Reducer Lambda
β”‚
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ s3_io.py            # Streaming reads/writes
β”‚   └── s3_batch_writer.py
β”‚
β”œβ”€β”€ config/
β”‚   └── settings.py         # NUM_BUCKETS, BUCKET_TEMP, BUCKET_OUTPUT, etc.
β”‚
template.yaml               # Monolithic SAM stack for initial testing
β”‚
sam                         # Production-ready and fully modular SAM stack
β”œβ”€β”€ template.yaml           # Root modular SAM template
β”œβ”€β”€ lambdas/                # Individual Lambda stacks
β”œβ”€β”€ statemachine/           # State machine with Mapper, Group, and Reducer Lambdas
β”‚
aws_tests/
β”‚   └── manual/             # S3 content creation
tests/unit/
β”‚   β”œβ”€β”€ test_group_mapper_outputs.py
β”‚   β”œβ”€β”€ test_mapper_hashing.py
β”‚   β”œβ”€β”€ test_mapper_helpers.py
β”‚   └── test_reducer_helpers.py
β”‚
tests/integration/          # End-to-end tests for mapper and reducer

🧱 Architecture Overview

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚   Raw Input S3 Bucket     β”‚
                 β”‚  (JSONL files: urls etc.) β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Step Functions    β”‚
                    β”‚  MAP State         β”‚
                    β”‚  (parallel invoke) β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚ each input file
                              β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚   Mapper Lambda      β”‚
                   β”‚  - Reads JSONL       β”‚
                   β”‚  - Buckets URLs      β”‚
                   β”‚  - Writes partials   β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Temporary S3 Bucket  β”‚
                    β”‚  (partial counts)    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚   Grouping Lambda                    β”‚
               β”‚  - group mapper outputs by bucket_id β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  Step Functions    β”‚
                      β”‚  MAP State         β”‚
                      β”‚  (parallel invoke) β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚ each set of temp files for one bucket
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Reducer Lambda   β”‚
                       β”‚ - Reads partials β”‚
                       β”‚ - Merges counts  β”‚
                       β”‚ - Writes final   β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚ Final Output Bucket β”‚
                       β”‚ bucket_X.json       β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Summary of Flow

  1. Mapper Lambda
    Reads raw JSONL files in a streaming manner, hashes keys into buckets, writes partial counts to S3.

  2. Grouping Lambda
    Collects mapper outputs and groups them by bucket ID.

  3. Reducer Lambda
    Processes each bucket independently and merges partial results.

  4. Final Output
    Per-bucket aggregates written to the output bucket.

πŸ“Œ Quick Start

Install dependencies:

uv sync

Run tests:

uv run pytest

πŸš€ Deployment Notes

The Lambda functions are intentionally:

  • simple
  • synchronous
  • stateless

This keeps the workflow predictable and aligned with AWS best practices for high-throughput serverless MapReduce pipelines.

🧱 SAM Deployment Architectures

1. Monolithic SAM Template

  • single template.yaml
  • ideal for prototyping

2. Modular Production Architecture

  • nested stacks
  • decoupled compute + workflow layers
  • clean CI/CD

This is the recommended deployment model.

βš™οΈ Deploying the Modular Version (recommended)

1. Build

sam build

2. Deploy

sam deploy --guided

3. Run execution (sync)

aws stepfunctions start-sync-execution \
  --state-machine-arn <ARN> \
  --input file://input.json \
  --region ap-south-1

πŸ§ͺ Sample Input

{
  "input_files": [
    {
      "bucket": "mapreduce-lambda-raw-mumbai",
      "key": "events/2025/12/01/file00.jsonl"
    },
    {
      "bucket": "mapreduce-lambda-raw-mumbai",
      "key": "events/2025/12/01/file08.jsonl"
    }
  ]
}

Output

[
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_6.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_12.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_21.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_23.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_28.json"
  }
]

πŸ§ͺ Testing

Includes:

  • Unit tests (pure logic)
  • Integration tests (mapper + reducer)
  • Manual AWS tests for S3 batch writer

Mocking ensures no AWS calls during local tests.

βœ… Tests in Action

Run all tests:

uv run pytest -v tests

Test Output:

========================= test session starts =========================
collected 14 items

tests/integration/test_mapper_integration.py::test_mapper_lambda_integration PASSED [  7%]
tests/integration/test_reducer_integration.py::test_reducer_lambda_integration PASSED [ 14%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_valid PASSED        [ 21%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_invalid PASSED      [ 28%]
tests/unit/test_group_mapper_outputs.py::test_get_bucket_keys_map_basic PASSED      [ 35%]
tests/unit/test_group_mapper_outputs.py::test_get_nonempty_buckets_and_keys PASSED  [ 42%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_full_flow PASSED       [ 50%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_empty_results PASSED   [ 57%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_missing_results_key PASSED [ 64%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_inner_empty_outputs PASSED [ 71%]
tests/unit/test_mapper_hashing.py::test_simple_bucket_hash PASSED                   [ 78%]
tests/unit/test_mapper_helpers.py::test_populate_partition_counts_basic PASSED      [ 85%]
tests/unit/test_mapper_helpers.py::test_write_partition_counts_to_s3_basic PASSED   [ 92%]
tests/unit/test_reducer_helpers.py::test_reduce_data_for_bucket_basic PASSED        [100%]

========================= 14 passed in 7.27s =========================

πŸ’‘ Future Enhancements

  • Support more scalable JSONL processing using S3 Select (server-side filtering)
  • Add CloudWatch structured logging + X-Ray tracing
  • Use S3 Inventory for scalable object discovery at bucket scale

🏷️ License

MIT License β€” free to use, modify, and share.

About

A serverless MapReduce data pipeline built with Lambda, Step Functions, and S3. Supports large JSONL processing with streaming I/O, hash-based partitioning, and scalable fan-out execution, packaged with modular SAM templates and full test coverage.

Resources

License

Stars

Watchers

Forks

Languages