Skip to content

Conversation

@derekmeegan
Copy link
Owner

feat: Implement Async API for Scraper with Job Tracking

Description:

This PR refactors the project from a direct Lambda invocation model to a robust, asynchronous API pattern using API Gateway, Lambda, and DynamoDB. This allows users to submit scraping jobs via a REST API, receive an immediate acknowledgement, and poll for results later, making it suitable for longer-running tasks and easier integration.

Key Changes:

  1. Architecture:

    • Introduced API Gateway with two main endpoints:
      • POST /scrape: Accepts job submissions (URL, required jobId), triggers the scraper Lambda asynchronously (non-proxy integration), and returns 202 Accepted. Request body is validated.
      • GET /scrape/{jobId}: Allows polling for job status and results by jobId (sync, proxy integration). Path parameter is validated.
    • Added a DynamoDB table (JobStatusTable) to store job metadata, status (PENDING, RUNNING, SUCCESS, FAILED), results, and timestamps.
    • Split Lambda functionality into two distinct functions:
      • scraper: Handles the core Playwright/Browserbase logic and updates DynamoDB.
      • getter: Handles reading job status/results from DynamoDB for the GET endpoint.
    • API Key authentication is enforced on both endpoints via a Usage Plan.
    • Enabled detailed Access and Execution logging for API Gateway to CloudWatch.
  2. Project Structure:

    • Renamed src/ to lambdas/.
    • Created separate subdirectories (lambdas/scraper/, lambdas/getter/) containing the code, Dockerfile, and requirements.txt for each Lambda function.
    • Added an /examples directory containing a quick_start.py client script and its requirements.txt.
  3. CDK Stack (infra/stack.py):

    • Defined resources for DynamoDB table, API Gateway (RestApi, Resources, Methods, Models, Validators, Integrations, ApiKey, UsagePlan, Logging), and both Lambda functions with appropriate roles and permissions (including DynamoDB access).
    • Configured integrations for async non-proxy (POST) and sync proxy (GET).
    • Updated Lambda asset paths and environment variables.
  4. Lambda Functions:

    • Scraper: Adapted to handle non-proxy event input, parse payload, interact with DynamoDB for status updates/results, and manage errors.
    • Getter: New function to query DynamoDB by jobId from path parameters and return results, handling Decimal types for JSON serialization.
  5. Example Client (examples/quick_start.py):

    • Added a Python script demonstrating the new workflow: submitting a job with a client-generated jobId and polling the GET endpoint for completion. Reads API endpoint/key from environment variables.
  6. Documentation (README.md):

    • Significantly updated to reflect the new asynchronous architecture, API usage (POST/GET examples), project structure, and setup instructions (including setting environment variables for the example).
  7. CI/CD (.github/workflows/deploy.yaml):

    • Added staging branch to workflow triggers.
  8. .gitignore:

    • Added .env* and *.ipynb to ignore environment files and notebooks.

Impact:

  • This is a breaking change in how the service is invoked. Direct Lambda invocation is replaced by the API Gateway endpoints.
  • Provides a scalable, decoupled architecture suitable for webhooks or UI integration.

Testing:

  1. Deploy the stack via CDK (cdk deploy --all).
  2. Retrieve the API endpoint URL and API Key value (e.g., using the AWS console or CLI commands provided in the updated README).
  3. Set the API_ENDPOINT_URL and API_KEY environment variables.
  4. Run the example script: pip install -r examples/requirements.txt && python examples/quick_start.py.
  5. Verify successful job submission (202), polling, and eventual retrieval of results (SUCCESS status with data) or failure (FAILED status with error message) from DynamoDB.
  6. Check CloudWatch logs for both Lambdas and API Gateway for any errors.

@derekmeegan derekmeegan merged commit eab82c0 into main Apr 30, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants