A lightweight, fully serverless MapReduce-style data processing pipeline built on AWS Lambda, Step Functions (Express Workflows), and Amazon S3.
It processes large volumes of JSONL input files containing URLs and produces aggregated counts per unique URL β with no EMR, no EC2, and no servers to manage.
- π Features
- π Tech Stack
- π Project Structure
- π§± Architecture Overview
- π Summary of Flow
- π Quick Start
- π Deployment Notes
- π§± SAM Deployment Architectures
- βοΈ Deploying the Modular Version
- π§ͺ Testing
- π‘ Future Enhancements
- π·οΈ License
This project demonstrates how to process large datasets without running out of memory, using:
- Zero manual compute
- Horizontal scaling via Step Functions Map state
- Fan-out Mapper -> centralized Grouping -> fan-out Reducer
- Hash-based bucketing (map phase)
- Bucket-level aggregation (reduce phase)
- Streaming reads from S3
- Memory-safe chunk processing
- Temporary + final S3 storage layers
- AWS SAM (both monolithic and modular nested stacks)
| Component | Description |
|---|---|
| Python 3.13 | Core language |
| Amazon S3 | Storage layer |
| AWS Lambda | Compute layer |
| Step Functions | Workflow orchestration |
| AWS SAM | Infrastructure deployment |
| boto3 | AWS SDK |
| pytest | Testing framework |
| uv | Lightweight dependency tool |
src/mapreduce_lambda_aws/
β
βββ mapper/
β βββ handler.py # Mapper Lambda
β
βββ grouping/
β βββ handler.py # GroupOutputs Lambda (organizes reducer jobs)
β
βββ reducer/
β βββ handler.py # Reducer Lambda
β
βββ utils/
β βββ s3_io.py # Streaming reads/writes
β βββ s3_batch_writer.py
β
βββ config/
β βββ settings.py # NUM_BUCKETS, BUCKET_TEMP, BUCKET_OUTPUT, etc.
β
template.yaml # Monolithic SAM stack for initial testing
β
sam # Production-ready and fully modular SAM stack
βββ template.yaml # Root modular SAM template
βββ lambdas/ # Individual Lambda stacks
βββ statemachine/ # State machine with Mapper, Group, and Reducer Lambdas
β
aws_tests/
β βββ manual/ # S3 content creation
tests/unit/
β βββ test_group_mapper_outputs.py
β βββ test_mapper_hashing.py
β βββ test_mapper_helpers.py
β βββ test_reducer_helpers.py
β
tests/integration/ # End-to-end tests for mapper and reducer
βββββββββββββββββββββββββββββ
β Raw Input S3 Bucket β
β (JSONL files: urls etc.) β
ββββββββββββββ¬βββββββββββββββ
β
βΌ
ββββββββββββββββββββββ
β Step Functions β
β MAP State β
β (parallel invoke) β
βββββββββββ¬βββββββββββ
β each input file
βΌ
ββββββββββββββββββββββββ
β Mapper Lambda β
β - Reads JSONL β
β - Buckets URLs β
β - Writes partials β
βββββββββββββ¬βββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Temporary S3 Bucket β
β (partial counts) β
βββββββββββββ¬βββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Grouping Lambda β
β - group mapper outputs by bucket_id β
βββββββββββββββββ¬βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββ
β Step Functions β
β MAP State β
β (parallel invoke) β
βββββββββββ¬βββββββββββ
β each set of temp files for one bucket
βΌ
ββββββββββββββββββββ
β Reducer Lambda β
β - Reads partials β
β - Merges counts β
β - Writes final β
ββββββββββββ¬ββββββββ
βΌ
βββββββββββββββββββββββ
β Final Output Bucket β
β bucket_X.json β
βββββββββββββββββββββββ
-
Mapper Lambda
Reads raw JSONL files in a streaming manner, hashes keys into buckets, writes partial counts to S3. -
Grouping Lambda
Collects mapper outputs and groups them by bucket ID. -
Reducer Lambda
Processes each bucket independently and merges partial results. -
Final Output
Per-bucket aggregates written to the output bucket.
Install dependencies:
uv syncRun tests:
uv run pytestThe Lambda functions are intentionally:
- simple
- synchronous
- stateless
This keeps the workflow predictable and aligned with AWS best practices for high-throughput serverless MapReduce pipelines.
- single template.yaml
- ideal for prototyping
- nested stacks
- decoupled compute + workflow layers
- clean CI/CD
This is the recommended deployment model.
sam build
sam deploy --guided
aws stepfunctions start-sync-execution \
--state-machine-arn <ARN> \
--input file://input.json \
--region ap-south-1
{
"input_files": [
{
"bucket": "mapreduce-lambda-raw-mumbai",
"key": "events/2025/12/01/file00.jsonl"
},
{
"bucket": "mapreduce-lambda-raw-mumbai",
"key": "events/2025/12/01/file08.jsonl"
}
]
}[
{
"output_bucket": "mapreduce-lambda-output-mumbai",
"output_key": "reduce-output/bucket_6.json"
},
{
"output_bucket": "mapreduce-lambda-output-mumbai",
"output_key": "reduce-output/bucket_12.json"
},
{
"output_bucket": "mapreduce-lambda-output-mumbai",
"output_key": "reduce-output/bucket_21.json"
},
{
"output_bucket": "mapreduce-lambda-output-mumbai",
"output_key": "reduce-output/bucket_23.json"
},
{
"output_bucket": "mapreduce-lambda-output-mumbai",
"output_key": "reduce-output/bucket_28.json"
}
]Includes:
- Unit tests (pure logic)
- Integration tests (mapper + reducer)
- Manual AWS tests for S3 batch writer
Mocking ensures no AWS calls during local tests.
Run all tests:
uv run pytest -v testsTest Output:
========================= test session starts =========================
collected 14 items
tests/integration/test_mapper_integration.py::test_mapper_lambda_integration PASSED [ 7%]
tests/integration/test_reducer_integration.py::test_reducer_lambda_integration PASSED [ 14%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_valid PASSED [ 21%]
tests/unit/test_group_mapper_outputs.py::test_extract_bucket_id_invalid PASSED [ 28%]
tests/unit/test_group_mapper_outputs.py::test_get_bucket_keys_map_basic PASSED [ 35%]
tests/unit/test_group_mapper_outputs.py::test_get_nonempty_buckets_and_keys PASSED [ 42%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_full_flow PASSED [ 50%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_empty_results PASSED [ 57%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_missing_results_key PASSED [ 64%]
tests/unit/test_group_mapper_outputs.py::test_lambda_handler_inner_empty_outputs PASSED [ 71%]
tests/unit/test_mapper_hashing.py::test_simple_bucket_hash PASSED [ 78%]
tests/unit/test_mapper_helpers.py::test_populate_partition_counts_basic PASSED [ 85%]
tests/unit/test_mapper_helpers.py::test_write_partition_counts_to_s3_basic PASSED [ 92%]
tests/unit/test_reducer_helpers.py::test_reduce_data_for_bucket_basic PASSED [100%]
========================= 14 passed in 7.27s =========================
- Support more scalable JSONL processing using S3 Select (server-side filtering)
- Add CloudWatch structured logging + X-Ray tracing
- Use S3 Inventory for scalable object discovery at bucket scale
MIT License β free to use, modify, and share.