Upgrade API to be async and provide gateway endpoint for other services #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat: Implement Async API for Scraper with Job Tracking
Description:
This PR refactors the project from a direct Lambda invocation model to a robust, asynchronous API pattern using API Gateway, Lambda, and DynamoDB. This allows users to submit scraping jobs via a REST API, receive an immediate acknowledgement, and poll for results later, making it suitable for longer-running tasks and easier integration.
Key Changes:
Architecture:
POST /scrape: Accepts job submissions (URL, requiredjobId), triggers the scraper Lambda asynchronously (non-proxy integration), and returns202 Accepted. Request body is validated.GET /scrape/{jobId}: Allows polling for job status and results byjobId(sync, proxy integration). Path parameter is validated.JobStatusTable) to store job metadata, status (PENDING,RUNNING,SUCCESS,FAILED), results, and timestamps.scraper: Handles the core Playwright/Browserbase logic and updates DynamoDB.getter: Handles reading job status/results from DynamoDB for the GET endpoint.Project Structure:
src/tolambdas/.lambdas/scraper/,lambdas/getter/) containing the code,Dockerfile, andrequirements.txtfor each Lambda function./examplesdirectory containing aquick_start.pyclient script and itsrequirements.txt.CDK Stack (
infra/stack.py):POST) and sync proxy (GET).Lambda Functions:
jobIdfrom path parameters and return results, handlingDecimaltypes for JSON serialization.Example Client (
examples/quick_start.py):jobIdand polling the GET endpoint for completion. Reads API endpoint/key from environment variables.Documentation (
README.md):POST/GETexamples), project structure, and setup instructions (including setting environment variables for the example).CI/CD (
.github/workflows/deploy.yaml):stagingbranch to workflow triggers..gitignore:
.env*and*.ipynbto ignore environment files and notebooks.Impact:
Testing:
cdk deploy --all).API_ENDPOINT_URLandAPI_KEYenvironment variables.pip install -r examples/requirements.txt && python examples/quick_start.py.SUCCESSstatus with data) or failure (FAILEDstatus with error message) from DynamoDB.