Upgrade API to be async and provide gateway endpoint for other services #1

derekmeegan · 2025-04-30T16:40:39Z

feat: Implement Async API for Scraper with Job Tracking

Description:

This PR refactors the project from a direct Lambda invocation model to a robust, asynchronous API pattern using API Gateway, Lambda, and DynamoDB. This allows users to submit scraping jobs via a REST API, receive an immediate acknowledgement, and poll for results later, making it suitable for longer-running tasks and easier integration.

Key Changes:

Architecture:
- Introduced API Gateway with two main endpoints:
  - POST /scrape: Accepts job submissions (URL, required jobId), triggers the scraper Lambda asynchronously (non-proxy integration), and returns 202 Accepted. Request body is validated.
  - GET /scrape/{jobId}: Allows polling for job status and results by jobId (sync, proxy integration). Path parameter is validated.
- Added a DynamoDB table (JobStatusTable) to store job metadata, status (PENDING, RUNNING, SUCCESS, FAILED), results, and timestamps.
- Split Lambda functionality into two distinct functions:
  - scraper: Handles the core Playwright/Browserbase logic and updates DynamoDB.
  - getter: Handles reading job status/results from DynamoDB for the GET endpoint.
- API Key authentication is enforced on both endpoints via a Usage Plan.
- Enabled detailed Access and Execution logging for API Gateway to CloudWatch.
Project Structure:
- Renamed src/ to lambdas/.
- Created separate subdirectories (lambdas/scraper/, lambdas/getter/) containing the code, Dockerfile, and requirements.txt for each Lambda function.
- Added an /examples directory containing a quick_start.py client script and its requirements.txt.
CDK Stack (infra/stack.py):
- Defined resources for DynamoDB table, API Gateway (RestApi, Resources, Methods, Models, Validators, Integrations, ApiKey, UsagePlan, Logging), and both Lambda functions with appropriate roles and permissions (including DynamoDB access).
- Configured integrations for async non-proxy (POST) and sync proxy (GET).
- Updated Lambda asset paths and environment variables.
Lambda Functions:
- Scraper: Adapted to handle non-proxy event input, parse payload, interact with DynamoDB for status updates/results, and manage errors.
- Getter: New function to query DynamoDB by jobId from path parameters and return results, handling Decimal types for JSON serialization.
Example Client (examples/quick_start.py):
- Added a Python script demonstrating the new workflow: submitting a job with a client-generated jobId and polling the GET endpoint for completion. Reads API endpoint/key from environment variables.
Documentation (README.md):
- Significantly updated to reflect the new asynchronous architecture, API usage (POST/GET examples), project structure, and setup instructions (including setting environment variables for the example).
CI/CD (.github/workflows/deploy.yaml):
- Added staging branch to workflow triggers.
.gitignore:
- Added .env* and *.ipynb to ignore environment files and notebooks.

Impact:

This is a breaking change in how the service is invoked. Direct Lambda invocation is replaced by the API Gateway endpoints.
Provides a scalable, decoupled architecture suitable for webhooks or UI integration.

Testing:

Deploy the stack via CDK (cdk deploy --all).
Retrieve the API endpoint URL and API Key value (e.g., using the AWS console or CLI commands provided in the updated README).
Set the API_ENDPOINT_URL and API_KEY environment variables.
Run the example script: pip install -r examples/requirements.txt && python examples/quick_start.py.
Verify successful job submission (202), polling, and eventual retrieval of results (SUCCESS status with data) or failure (FAILED status with error message) from DynamoDB.
Check CloudWatch logs for both Lambdas and API Gateway for any errors.

derekmeegan and others added 11 commits April 30, 2025 10:44

initial async api design

d1f8e43

fix duplicate execution role alias

6dbfb30

add git ignore

129de52

fix scraper import

e97c887

fix partition key reference

d456f56

fix stack

f3da548

get async api setup right and activate logging for gateway

40d1c05

add quick start example

f442ce4

remove extra space :)

f9eebbe

update readme

ee1ec4c

Update README.md

b0cf3e2

derekmeegan merged commit eab82c0 into main Apr 30, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Upgrade API to be async and provide gateway endpoint for other services #1

Upgrade API to be async and provide gateway endpoint for other services #1

Uh oh!

derekmeegan commented Apr 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Upgrade API to be async and provide gateway endpoint for other services #1

Upgrade API to be async and provide gateway endpoint for other services #1

Uh oh!

Conversation

derekmeegan commented Apr 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants