Skip to content

fix(rebalancer): isolate per-adapter failures in getPendingRebalances#3437

Open
droplet-rl wants to merge 2 commits into
masterfrom
droplet/T90K0AL22-C03GHT4RV42-1780130120-316509
Open

fix(rebalancer): isolate per-adapter failures in getPendingRebalances#3437
droplet-rl wants to merge 2 commits into
masterfrom
droplet/T90K0AL22-C03GHT4RV42-1780130120-316509

Conversation

@droplet-rl
Copy link
Copy Markdown
Contributor

Summary

A transient outage in one adapter's read API used to propagate up through BaseRebalancerClient.getPendingRebalances -> InventoryClient.update and crash the relayer process. This change catches per-adapter errors, logs them at error level with the adapter name, and continues aggregating from the others so the relayer keeps filling on chains served by healthy adapters and operates on a partial view of pending rebalances for that cycle.

Motivation

We just saw zion-across-relayer-primary exit with:

HttpRequestError: 502 Bad Gateway - nginx/1.22.1
    at HttpTransport.request (@nktkas/hyperliquid/.../http/mod.js:109)
    at getUserNonFundingLedgerUpdates (src/utils/HyperliquidUtils.ts)
    at HyperliquidStablecoinSwapAdapter._getInitiatedWithdrawalsFromHypercore (src/rebalancer/adapters/hyperliquid.ts:1006)
    at HyperliquidStablecoinSwapAdapter._getUnfinalizedWithdrawalAmountFromHypercore (src/rebalancer/adapters/hyperliquid.ts:1040)
    at HyperliquidStablecoinSwapAdapter.getPendingRebalances (src/rebalancer/adapters/hyperliquid.ts:550)
    at BaseRebalancerClient.getPendingRebalances (src/rebalancer/clients/BaseRebalancerClient.ts:56)

The call chain bottoms out in InventoryClient.update(), which has no catch around the rebalancer-client call. A transient HL read-API outage gates every fill the relayer would otherwise do on Ethereum, Arbitrum, Base, etc. That's the root problem: treating one venue's read API as a hard precondition for the whole relay loop.

Change

BaseRebalancerClient.getPendingRebalances now wraps each adapter.getPendingRebalances(account) in try/catch:

  • on success: same aggregation as before;
  • on failure: log at error (at: "BaseRebalancerClient#getPendingRebalances", includes the adapter name and stringified error), skip that adapter for the cycle.

The forEachAsync loop was switched from Object.values(this.adapters) to Object.entries(this.adapters) so the adapter name is available for logging.

Behavioral effect

The exact 502 from HyperliquidStablecoinSwapAdapter.getPendingRebalances that crashed the bot now logs and skips, leaving InventoryClient.pendingRebalances partial (HL contribution absent) for that update cycle. The next cycle retries.

For a sustained HL outage, the relayer will still attempt to initiate new HL swap rebalances since rebalanceInventory doesn't gate on read-side health. That's intentionally out of scope here; will likely follow up with read-side health gating on HL rebalance initiation in a separate PR.

Ops note

Silent-by-default is by design here (relayer-up beats accounting-perfect), but operators should add a Datadog alert on the new BaseRebalancerClient#getPendingRebalances error log so a persistent adapter outage doesn't quietly degrade inventory accounting.

Test plan

  • tsc --noEmit clean
  • eslint + prettier clean on touched files
  • hardhat test test/RebalancerClient.cumulativeRebalancing.ts - 17/17 passing
  • New unit test "Isolates per-adapter failures when aggregating pending rebalances" constructs one healthy + one throwing adapter and asserts the healthy adapter's pending rebalances still come through and aggregation does not throw
  • Operator: confirm new error log surfaces in Datadog under the expected at: value and wire an alert

🤖 Generated with Claude Code

A transient outage in one adapter's read API (e.g. a 502 from
Hyperliquid's InfoClient when fetching non-funding ledger updates) used
to propagate up through BaseRebalancerClient.getPendingRebalances to
InventoryClient.update and crash the relayer process. Catch the error
per adapter, log it, and continue aggregating from the others so the
relayer keeps filling on chains served by healthy adapters and operates
on a partial view of pending rebalances for that cycle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f43cf50182

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +62 to +64
// Isolate per-adapter failures: a transient outage in one venue's read API (e.g. a 502
// from Hyperliquid) must not propagate up to callers like InventoryClient.update and take
// the relayer down. The caller operates on a partial view for this cycle.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document partial pending-rebalance reads

This changes getPendingRebalances from fail-fast aggregation to returning a partial view when any adapter read fails, but the rebalancer README still describes the client as aggregating the account's pending state without noting this failure mode. The repo's AGENTS.md requires relevant README/AGENTS updates whenever behavior or runtime flow changes; without that update, operators and consumers of ReadOnlyRebalancerClient can assume the returned accounting is complete during an adapter outage when it is intentionally incomplete.

Useful? React with 👍 / 👎.

…nces

Addresses Codex review on #3437: AGENTS.md requires README updates when
runtime behavior changes, and the new per-adapter failure isolation in
BaseRebalancerClient.getPendingRebalances now returns a partial view
during an adapter outage. Note the contract for ReadOnlyRebalancerClient
consumers (InventoryClient, Monitor) and clarify that direct adapter
callers (Binance finalizer) are unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@droplet-rl
Copy link
Copy Markdown
Contributor Author

@codex thanks — fair point. The repo's AGENTS.md (line 15) explicitly requires README.md/AGENTS.md updates when runtime behavior changes, and the new partial-aggregation semantics weren't reflected in src/rebalancer/README.md.

Pushed a0d60ba adding a "Partial-view semantics on adapter failure" subsection under Read-only mode: ReadOnlyRebalancerClient. It records:

  • BaseRebalancerClient.getPendingRebalances now isolates per-adapter failures, logs at error with at: "BaseRebalancerClient#getPendingRebalances" + adapter name, and returns a partial aggregate from the remaining adapters.
  • ReadOnlyRebalancerClient consumers (InventoryClient.update, Monitor) see an intentionally partial view during an adapter outage and operators should alert on the new error log so persistent outages don't quietly degrade inventory accounting.
  • Direct adapter callers that bypass BaseRebalancerClient (notably the Binance finalizer in src/finalizer/utils/binance.ts, which calls the Binance adapter directly) are unaffected — their errors still propagate, so the existing "errors surface normally" contract for that path is preserved.

Please re-review.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@droplet-rl
Copy link
Copy Markdown
Contributor Author

Acknowledged — Codex re-review on a0d60ba came back clean. No further changes.

message: `Failed to fetch pending rebalances from ${adapterName} adapter; skipping for this cycle`,
error: (error as Error)?.toString() ?? String(error),
});
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to return undefined here? Should we instead be returning 0 or something, in case the value is dereferenced?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — it's safe: the return; exits only the per-iteration forEachAsync callback, not getPendingRebalances itself.

  • forEachAsync in the SDK is await mapAsync(array, callback) which is Promise.all(array.map(callback)) — the callback's return value is discarded, so returning early just skips aggregation for that adapter.
  • The outer getPendingRebalances always returns pendingRebalances, which is initialized to {} on entry and only ever gets keys added via pendingRebalances[chainId][token] = ... for adapters whose read succeeded. It is never undefined, and for the skipped adapter the corresponding chainId/token entry is simply absent (rather than zero) — i.e. a partial map.
  • Consumers handle that shape naturally:
    • InventoryClient (src/clients/InventoryClient.ts:307-317) reads via pendingRebalances[chainId]?.[l1TokenSymbol] with isDefined guards, so a missing chain/token entry contributes 0 to the balance computation — exactly the same effect a literal bnZero would have, without forcing us to enumerate which chain/token pairs the failed adapter would have reported.
    • Monitor similarly indexes via pendingRebalances[chainId]?.[token] with optional chaining.

So from a caller's perspective the choice is between "missing key" and "key present with value 0" — both are dereferenced safely, and "missing key" has the advantage that we don't have to guess the set of chain/token pairs the failed adapter would have populated (it differs per adapter and per route config). Happy to switch to explicit zero-fill if you'd rather have the dictionary shape be uniform across success/failure cases, but the current form keeps the failure semantics minimal — "this adapter contributed nothing" rather than "this adapter contributed zero to chains X, Y, Z."

@droplet-rl
Copy link
Copy Markdown
Contributor Author

Thanks for the review @pxrl — replied to your inline question on the return;. TL;DR: it exits only the per-iteration forEachAsync callback (the SDK's forEachAsync discards callback return values), so the outer getPendingRebalances always returns the initialized pendingRebalances map. Skipped adapters leave their chain/token keys absent rather than zero, and existing consumers (InventoryClient, Monitor) already index with isDefined/optional chaining so a missing entry contributes 0. Happy to switch to explicit zero-fill if you'd prefer uniform shape — flagged that as the alternative in the inline reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants