-
Notifications
You must be signed in to change notification settings - Fork 188
Retry transient 5xx errors in step handler and queue operations #1011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: 4dd3ee5 The changes in this PR will be included in the next version bump. This PR includes changesets to release 15 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express | Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (42 failed)turso (42 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends workflow-server transient error handling into the step handler by introducing a 5xx retry helper and applying it to step lifecycle event writes, aiming to avoid consuming step attempts on transient infrastructure failures.
Changes:
- Add
withServerErrorRetryhelper to retry workflow-server 5xx errors with exponential backoff. - Wrap step handler
world.events.createcalls forstep_started,step_completed,step_failed, andstep_retryingwith the retry helper. - Add a 5xx “bubble to queue retry” path in the step execution error handling block.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
packages/core/src/runtime/step-handler.ts |
Wrap step lifecycle event creation with 5xx retry; add logic to rethrow persistent 5xx to defer to queue retry. |
packages/core/src/runtime/helpers.ts |
Introduce withServerErrorRetry with 3 retries and exponential backoff for 5xx WorkflowAPIErrors. |
.changeset/retry-5xx-step-handler.md |
Patch changeset describing the new retry behavior in the step handler. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const startResult = await withServerErrorRetry(() => | ||
| world.events.create(workflowRunId, { | ||
| eventType: 'step_started', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| correlationId: stepId, | ||
| }) | ||
| ); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds new retry behavior that changes how the step handler reacts to workflow-server 5xx responses, but there are no unit tests covering the new helper’s retry/backoff semantics or the step-handler’s 5xx fast-path (throwing to queue vs. emitting step_retrying). Adding vitest coverage (similar to runtime/start.test.ts) would help prevent regressions in retry counts/delays and in when attempts are consumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests in helpers.test.ts covering: success passthrough, retry on 5xx with recovery, exponential backoff across 3 retries, retry exhaustion, and non-retryable error passthrough (non-5xx, non-WorkflowAPIError, 429).
Add `withServerErrorRetry` helper that retries world calls on 5xx errors with exponential backoff (500ms, 1s, 2s ≈ 3.5s total). Applied to all `world.events.create` calls in the step handler so transient workflow-server errors don't consume step attempts. If retries are exhausted, the error is thrown to the queue for higher-level retry without burning a step attempt. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Fix misleading `maxAttempts` log field to `maxRetries` in withServerErrorRetry - Update step-handler comment to accurately note that queue retries may still consume step attempts since step_started has already incremented - Add unit tests for withServerErrorRetry (7 tests covering success, retry/backoff, exhaustion, and non-5xx passthrough) Co-Authored-By: Claude Opus 4.6 <[email protected]>
fffe061 to
e3ef703
Compare
Unit tests for withThrottleRetry and withServerErrorRetry helpers, plus an e2e test that exercises the 5xx retry codepath during step execution via run-scoped fault injection. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
VaguelySerious
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some concerns about what "transient" means for 500s and forever-retrying workflows
| export async function withServerErrorRetry<T>( | ||
| fn: () => Promise<T> | ||
| ): Promise<T> { | ||
| const delays = [500, 1000, 2000]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like more backoff here. 500s might be transient, but in our history, most 500s were fixed after 5-60 minutes, not a few seconds. The only transient 500s I remember is dynamodb throttling, which we should be returning 429s for, but I guess this is safe since we're only doing three re-tries.
| // subsequent queue retry may increment attempts again depending on | ||
| // storage semantics, so these retries are not guaranteed to be | ||
| // "free" with respect to step attempts. | ||
| if (err.status !== undefined && err.status >= 500) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't consume a re-try, we will forever-retry specific endpoints that throw consistent 500s 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HH: not forever. only 64 times I think (the max queue deliveries). Maybe @ctgowrie has thoughts?
| }; | ||
| } | ||
| // Wrap VQS server errors as WorkflowAPIError so withServerErrorRetry can catch them | ||
| if (error instanceof InternalServerError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer we let queue client handle so we can control with server
|
moving to draft till I remove the queue retrying and rebase this on main for merge conflicts |
Human (hey it's me Pranay)
In addition to retrying 5xx errors, I added an e2e test here for steps. For AI and human reviewers: pleae pay careful attention to the e2e test to validate that it actually works and tests the right thing and isn't a possible false positive.
This PR also wraps queue operations with retrying (similar to 5xx errors from workflow-server, if queue experiences transient network errors, we should have some retrying). It may be better for this to live inside
@vercel/queueclient natively but I added it here anywayread on:
AI
Summary
Step handler 5xx retry (
@workflow/core)withServerErrorRetryhelper that retries on 5xxWorkflowAPIErrorwith exponential backoff (500ms, 1s, 2s — 3.5s total)world.events.createcalls in the step handler (step_started,step_completed,step_failed,step_retrying) with the retry helperstep_retrying, so no step attempt is consumedQueue operation retry (
@workflow/core+@workflow/world-vercel)InternalServerError,ConsumerDiscoveryError,ConsumerRegistryNotConfiguredError) asWorkflowAPIErrorwith status 500/502/503 at theworld-vercelboundary — matches howutils.ts:makeRequest()already normalizes HTTP errors for eventsworld.queue()inqueueMessage()withwithServerErrorRetry, giving all call sites (step skip, max retries re-queue, step completion continuation, error handler continuation) automatic retry protection for transient queue failuresTests
helpers.test.ts): 16 tests coveringwithServerErrorRetry(7 tests) andwithThrottleRetry(9 tests)queue.test.ts): 3 tests verifying VQS errors are wrapped asWorkflowAPIErrorwith correct status codese2e.test.ts):serverError5xxRetryWorkflowuses run-scoped fault injection to makestep_completedcalls throw 500 errors, then verifies the workflow completes correctly, retries actually fired, and no step attempt was consumedTest plan
pnpm buildpassespnpm testinpackages/corepassespnpm testinpackages/world-vercelpasses (27 tests including 3 new VQS error wrapping tests)pnpm vitest run packages/core/src/runtime/helpers.test.ts— 16 tests passserverError5xxRetryWorkflowpasses locally against nextjs-turbopack dev server🤖 Generated with Claude Code