📊 MapReduce Lambda Pipeline for URL Counting

A lightweight, fully serverless MapReduce-style data processing pipeline built on AWS Lambda, Step Functions (Express Workflows), and Amazon S3.
It processes large volumes of JSONL input files containing URLs and produces aggregated counts per unique URL — with no EMR, no EC2, and no servers to manage.

🚀 Features

This project demonstrates how to process large datasets without running out of memory, using:

Zero manual compute
Horizontal scaling via Step Functions Map state
Fan-out Mapper -> centralized Grouping -> fan-out Reducer
Hash-based bucketing (map phase)
Bucket-level aggregation (reduce phase)
Streaming reads from S3
Memory-safe chunk processing
Temporary + final S3 storage layers
AWS SAM (both monolithic and modular nested stacks)

🏗 Tech Stack

Component	Description
Python 3.13	Core language
Amazon S3	Storage layer
AWS Lambda	Compute layer
Step Functions	Workflow orchestration
AWS SAM	Infrastructure deployment
boto3	AWS SDK
pytest	Testing framework
uv	Lightweight dependency tool

📁 Project Structure

src/mapreduce_lambda_aws/
│
├── mapper/
│   └── handler.py          # Mapper Lambda
│
├── grouping/
│   └── handler.py          # GroupOutputs Lambda (organizes reducer jobs)
│
├── reducer/
│   └── handler.py          # Reducer Lambda
│
├── utils/
│   ├── s3_io.py            # Streaming reads/writes
│   └── s3_batch_writer.py
│
├── config/
│   └── settings.py         # NUM_BUCKETS, BUCKET_TEMP, BUCKET_OUTPUT, etc.
│
template.yaml               # Monolithic SAM stack for initial testing
│
sam                         # Production-ready and fully modular SAM stack
├── template.yaml           # Root modular SAM template
├── lambdas/                # Individual Lambda stacks
├── statemachine/           # State machine with Mapper, Group, and Reducer Lambdas
│
aws_tests/
│   └── manual/             # S3 content creation
tests/unit/
│   ├── test_group_mapper_outputs.py
│   ├── test_mapper_hashing.py
│   ├── test_mapper_helpers.py
│   └── test_reducer_helpers.py
│
tests/integration/          # End-to-end tests for mapper and reducer

🧱 Architecture Overview

                 ┌───────────────────────────┐
                 │   Raw Input S3 Bucket     │
                 │  (JSONL files: urls etc.) │
                 └────────────┬──────────────┘
                              │
                              ▼
                    ┌────────────────────┐
                    │  Step Functions    │
                    │  MAP State         │
                    │  (parallel invoke) │
                    └─────────┬──────────┘
                              │ each input file
                              ▼
                   ┌──────────────────────┐
                   │   Mapper Lambda      │
                   │  - Reads JSONL       │
                   │  - Buckets URLs      │
                   │  - Writes partials   │
                   └───────────┬──────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │ Temporary S3 Bucket  │
                    │  (partial counts)    │
                    └───────────┬──────────┘
                                │
                                ▼
               ┌──────────────────────────────────────┐
               │   Grouping Lambda                    │
               │  - group mapper outputs by bucket_id │
               └───────────────┬──────────────────────┘
                               │
                               ▼
                      ┌────────────────────┐
                      │  Step Functions    │
                      │  MAP State         │
                      │  (parallel invoke) │
                      └─────────┬──────────┘
                                │ each set of temp files for one bucket
                                ▼
                       ┌──────────────────┐
                       │ Reducer Lambda   │
                       │ - Reads partials │
                       │ - Merges counts  │
                       │ - Writes final   │
                       └──────────┬───────┘
                                  ▼
                       ┌─────────────────────┐
                       │ Final Output Bucket │
                       │ bucket_X.json       │
                       └─────────────────────┘

📚 Summary of Flow

Mapper Lambda
Reads raw JSONL files in a streaming manner, hashes keys into buckets, writes partial counts to S3.
Grouping Lambda
Collects mapper outputs and groups them by bucket ID.
Reducer Lambda
Processes each bucket independently and merges partial results.
Final Output
Per-bucket aggregates written to the output bucket.

📌 Quick Start

Install dependencies:

uv sync

Run tests:

uv run pytest

🚀 Deployment Notes

The Lambda functions are intentionally:

simple
synchronous
stateless

This keeps the workflow predictable and aligned with AWS best practices for high-throughput serverless MapReduce pipelines.

🧱 SAM Deployment Architectures

1. Monolithic SAM Template

single template.yaml
ideal for prototyping

2. Modular Production Architecture

nested stacks
decoupled compute + workflow layers
clean CI/CD

This is the recommended deployment model.

⚙️ Deploying the Modular Version (recommended)

1. Build

sam build

2. Deploy

sam deploy --guided

3. Run execution (sync)

aws stepfunctions start-sync-execution \
  --state-machine-arn <ARN> \
  --input file://input.json \
  --region ap-south-1

🧪 Sample Input

{
  "input_files": [
    {
      "bucket": "mapreduce-lambda-raw-mumbai",
      "key": "events/2025/12/01/file00.jsonl"
    },
    {
      "bucket": "mapreduce-lambda-raw-mumbai",
      "key": "events/2025/12/01/file08.jsonl"
    }
  ]
}

Output

[
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_6.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_12.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_21.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_23.json"
  },
  {
    "output_bucket": "mapreduce-lambda-output-mumbai",
    "output_key": "reduce-output/bucket_28.json"
  }
]

🧪 Testing

Includes:

Unit tests (pure logic)
Integration tests (mapper + reducer)
Manual AWS tests for S3 batch writer

Mocking ensures no AWS calls during local tests.

✅ Tests in Action

Run all tests:

uv run pytest -v tests

Test Output:

========================= test session starts =========================
collected 14 items

tests/integration/test_mapper_integration.py::test_mapper_lambda_integration PASSED [  7%]
tests/integration/test_reducer_integration.py::test_reducer_lambda_integration PASSED [ 14%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_valid PASSED        [ 21%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_invalid PASSED      [ 28%]
tests/unit/test_group_mapper_outputs.py::test_get_bucket_keys_map_basic PASSED      [ 35%]
tests/unit/test_group_mapper_outputs.py::test_get_nonempty_buckets_and_keys PASSED  [ 42%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_full_flow PASSED       [ 50%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_empty_results PASSED   [ 57%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_missing_results_key PASSED [ 64%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_inner_empty_outputs PASSED [ 71%]
tests/unit/test_mapper_hashing.py::test_simple_bucket_hash PASSED                   [ 78%]
tests/unit/test_mapper_helpers.py::test_populate_partition_counts_basic PASSED      [ 85%]
tests/unit/test_mapper_helpers.py::test_write_partition_counts_to_s3_basic PASSED   [ 92%]
tests/unit/test_reducer_helpers.py::test_reduce_data_for_bucket_basic PASSED        [100%]

========================= 14 passed in 7.27s =========================

💡 Future Enhancements

Support more scalable JSONL processing using S3 Select (server-side filtering)
Add CloudWatch structured logging + X-Ray tracing
Use S3 Inventory for scalable object discovery at bucket scale

🏷️ License

MIT License — free to use, modify, and share.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
aws_tests/manual		aws_tests/manual
sam		sam
src/mapreduce_lambda_aws		src/mapreduce_lambda_aws
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
template.yaml		template.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 MapReduce Lambda Pipeline for URL Counting

📑 Table of Contents

🚀 Features

🏗 Tech Stack

📁 Project Structure

🧱 Architecture Overview

📚 Summary of Flow

📌 Quick Start

🚀 Deployment Notes

🧱 SAM Deployment Architectures

1. Monolithic SAM Template

2. Modular Production Architecture

⚙️ Deploying the Modular Version (recommended)

1. Build

2. Deploy

3. Run execution (sync)

🧪 Sample Input

Output

🧪 Testing

✅ Tests in Action

💡 Future Enhancements

🏷️ License

About

Uh oh!

Languages

License

Venkat-Gorla/mapreduce-lambda-aws

Folders and files

Latest commit

History

Repository files navigation

📊 MapReduce Lambda Pipeline for URL Counting

📑 Table of Contents

🚀 Features

🏗 Tech Stack

📁 Project Structure

🧱 Architecture Overview

📚 Summary of Flow

📌 Quick Start

🚀 Deployment Notes

🧱 SAM Deployment Architectures

1. Monolithic SAM Template

2. Modular Production Architecture

⚙️ Deploying the Modular Version (recommended)

1. Build

2. Deploy

3. Run execution (sync)

🧪 Sample Input

Output

🧪 Testing

✅ Tests in Action

💡 Future Enhancements

🏷️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages