fix(rebalancer): isolate per-adapter failures in getPendingRebalances#3437
fix(rebalancer): isolate per-adapter failures in getPendingRebalances#3437droplet-rl wants to merge 2 commits into
Conversation
A transient outage in one adapter's read API (e.g. a 502 from Hyperliquid's InfoClient when fetching non-funding ledger updates) used to propagate up through BaseRebalancerClient.getPendingRebalances to InventoryClient.update and crash the relayer process. Catch the error per adapter, log it, and continue aggregating from the others so the relayer keeps filling on chains served by healthy adapters and operates on a partial view of pending rebalances for that cycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f43cf50182
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Isolate per-adapter failures: a transient outage in one venue's read API (e.g. a 502 | ||
| // from Hyperliquid) must not propagate up to callers like InventoryClient.update and take | ||
| // the relayer down. The caller operates on a partial view for this cycle. |
There was a problem hiding this comment.
Document partial pending-rebalance reads
This changes getPendingRebalances from fail-fast aggregation to returning a partial view when any adapter read fails, but the rebalancer README still describes the client as aggregating the account's pending state without noting this failure mode. The repo's AGENTS.md requires relevant README/AGENTS updates whenever behavior or runtime flow changes; without that update, operators and consumers of ReadOnlyRebalancerClient can assume the returned accounting is complete during an adapter outage when it is intentionally incomplete.
Useful? React with 👍 / 👎.
…nces Addresses Codex review on #3437: AGENTS.md requires README updates when runtime behavior changes, and the new per-adapter failure isolation in BaseRebalancerClient.getPendingRebalances now returns a partial view during an adapter outage. Note the contract for ReadOnlyRebalancerClient consumers (InventoryClient, Monitor) and clarify that direct adapter callers (Binance finalizer) are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@codex thanks — fair point. The repo's Pushed
Please re-review. |
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Acknowledged — Codex re-review on |
| message: `Failed to fetch pending rebalances from ${adapterName} adapter; skipping for this cycle`, | ||
| error: (error as Error)?.toString() ?? String(error), | ||
| }); | ||
| return; |
There was a problem hiding this comment.
Is it correct to return undefined here? Should we instead be returning 0 or something, in case the value is dereferenced?
There was a problem hiding this comment.
Good question — it's safe: the return; exits only the per-iteration forEachAsync callback, not getPendingRebalances itself.
forEachAsyncin the SDK isawait mapAsync(array, callback)which isPromise.all(array.map(callback))— the callback's return value is discarded, so returning early just skips aggregation for that adapter.- The outer
getPendingRebalancesalways returnspendingRebalances, which is initialized to{}on entry and only ever gets keys added viapendingRebalances[chainId][token] = ...for adapters whose read succeeded. It is neverundefined, and for the skipped adapter the correspondingchainId/tokenentry is simply absent (rather than zero) — i.e. a partial map. - Consumers handle that shape naturally:
InventoryClient(src/clients/InventoryClient.ts:307-317) reads viapendingRebalances[chainId]?.[l1TokenSymbol]withisDefinedguards, so a missing chain/token entry contributes 0 to the balance computation — exactly the same effect a literalbnZerowould have, without forcing us to enumerate which chain/token pairs the failed adapter would have reported.Monitorsimilarly indexes viapendingRebalances[chainId]?.[token]with optional chaining.
So from a caller's perspective the choice is between "missing key" and "key present with value 0" — both are dereferenced safely, and "missing key" has the advantage that we don't have to guess the set of chain/token pairs the failed adapter would have populated (it differs per adapter and per route config). Happy to switch to explicit zero-fill if you'd rather have the dictionary shape be uniform across success/failure cases, but the current form keeps the failure semantics minimal — "this adapter contributed nothing" rather than "this adapter contributed zero to chains X, Y, Z."
|
Thanks for the review @pxrl — replied to your inline question on the |
Summary
A transient outage in one adapter's read API used to propagate up through
BaseRebalancerClient.getPendingRebalances->InventoryClient.updateand crash the relayer process. This change catches per-adapter errors, logs them aterrorlevel with the adapter name, and continues aggregating from the others so the relayer keeps filling on chains served by healthy adapters and operates on a partial view of pending rebalances for that cycle.Motivation
We just saw
zion-across-relayer-primaryexit with:The call chain bottoms out in
InventoryClient.update(), which has no catch around the rebalancer-client call. A transient HL read-API outage gates every fill the relayer would otherwise do on Ethereum, Arbitrum, Base, etc. That's the root problem: treating one venue's read API as a hard precondition for the whole relay loop.Change
BaseRebalancerClient.getPendingRebalancesnow wraps eachadapter.getPendingRebalances(account)in try/catch:error(at: "BaseRebalancerClient#getPendingRebalances", includes the adapter name and stringified error), skip that adapter for the cycle.The
forEachAsyncloop was switched fromObject.values(this.adapters)toObject.entries(this.adapters)so the adapter name is available for logging.Behavioral effect
The exact 502 from
HyperliquidStablecoinSwapAdapter.getPendingRebalancesthat crashed the bot now logs and skips, leavingInventoryClient.pendingRebalancespartial (HL contribution absent) for that update cycle. The next cycle retries.For a sustained HL outage, the relayer will still attempt to initiate new HL swap rebalances since
rebalanceInventorydoesn't gate on read-side health. That's intentionally out of scope here; will likely follow up with read-side health gating on HL rebalance initiation in a separate PR.Ops note
Silent-by-default is by design here (relayer-up beats accounting-perfect), but operators should add a Datadog alert on the new
BaseRebalancerClient#getPendingRebalanceserror log so a persistent adapter outage doesn't quietly degrade inventory accounting.Test plan
tsc --noEmitcleaneslint+prettierclean on touched fileshardhat test test/RebalancerClient.cumulativeRebalancing.ts- 17/17 passing"Isolates per-adapter failures when aggregating pending rebalances"constructs one healthy + one throwing adapter and asserts the healthy adapter's pending rebalances still come through and aggregation does not throwat:value and wire an alert🤖 Generated with Claude Code