Conversation
…mutation pressure The val-window service emits a lightweight DELETE on t_validator_rewards_summary every finalized checkpoint event (~6:24 min). On networks with ~1M validators this lowers to a ClickHouse mutation that rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes minutes and queues up faster than it can drain, saturating one merge core and stalling the head until an operator restart. Make the boundary advance gate the fire: only emit DELETE once the window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the last successful fire (default 32 epochs ~3.4h, ~70x headroom over the ~150s mutation cost on hoodi). The first event after start always fires to anchor the baseline; subsequent events skip until the cadence is met. The DELETE statement and boundary calculation are unchanged - the only observable difference is up to (cadence-1) extra epochs retained beyond the strict window (0.16% overshoot vs. the 20250-epoch window). The per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.
…UpTo race Each FinalizedCheckpointEvent in head mode launches a new `go AdvanceFinalized(...)`. When a previous invocation is still running (common when ProcessStateTransitionMetrics takes longer than the ~6:24 min finalized interval — networks with ~1M validators, or any catch-up scenario), two goroutines race over the same StateHistory: the newer one runs CleanUpTo at the end of its loop and evicts entries that the older one is still blocked on inside StateHistory.Wait / BlockHistory.Wait. The blocked goroutine then waits forever holding a processerBook slot, and successive races leak the whole 32-slot pool, surfacing as floods of "Waiting for too long to acquire page" warnings and a stuck head. Observed on goteth-hoodi this morning: a single dependency state at epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine held a Wait on it, blocking that slot for 30+ minutes; the analyzer stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy. Skip overlapping invocations via TryLock. The skipped one would have iterated a subset of the state keys the next invocation will see, and its CleanUpTo would have been a subset of what the next one performs, so dropping it is monotonically safe — no work is lost. The historical-mode synchronous call site (routines.go:208) is unaffected: head mode only starts after historical completes, so TryLock always succeeds there.
…tatus Validators between deposit and activation can be in one of two spec-defined sub-states: pending_initialized (eligibility epoch is FAR_FUTURE_EPOCH) or pending_queued (eligibility epoch is set). goteth read ActivationEligibilityEpoch from the beacon state into local memory but never persisted it, so downstream consumers could not split the two sub-states. This commit: - adds f_activation_eligibility_epoch (UInt64, default FAR_FUTURE_EPOCH) to t_validator_last_status via migration 000036 - extends the ValidatorLastStatus struct, ToArray, and the ClickHouse INSERT to carry the new field - reads validator.ActivationEligibilityEpoch in processValLastStatus - adds three invariant tests in tests/db_validator_test.py - documents the column in docs/tables.md Fixes #266
Fix advance finalized redownload
* fix: block rewards overflow * use helper for bigInt conversion * use uint256 instead of string * update docs --------- Co-authored-by: Zyra-V21 <zyrav21@proton.me>
PR #259 (block-rewards overflow fix) merged into dev today and claimed migration number 036 for alter_block_rewards_uint256. Renumber this PR's migration to 037 to avoid the collision and keep numerical ordering deterministic on rebase. No content change; pure file rename. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…letes fix: prevent goteth stalls on networks with large validator sets
fix base fee byte order
fix: persist activation_eligibility_epoch
Two issues addressed on top of the existing outer-loop change:
1. Shutdown busy-loop. The outer `for` loop did not check `s.stop`
between iterations. `runHistorical` returns immediately on
`s.stop`, but since the chain keeps advancing the new outer loop
re-queried `RequestCurrentHead` and called `runHistorical` again,
producing a tight CPU-bound spin on shutdown. Add an explicit
`if s.stop { return headSlot }` guard right after `runHistorical`.
2. Handoff threshold sits exactly on the pool capacity. The previous
threshold was `SlotsPerEpoch` (32 slots), which is also the size
of `processerBook` (`utils.NewRoutineBook(32, ...)` in
chain_analyzer.go). Returning with a 32-slot gap lets `runHead`'s
first enqueue burst fill every page in the pool; if any of those
slots hit a cross-epoch `BlockHistory.Wait` dependency, the pool
deadlocks — the failure mode this loop was added to avoid in the
first place. Drop the threshold to `SlotsPerEpoch / 2` so there
is room for the cross-epoch dependencies to land without the
first dispatch burst sitting on the edge of the pool.
The threshold change adds at most one or two extra iterations near
the end of catch-up (each iteration is bounded by `runHistorical`
draining its slot range) and removes the only path that can leave
`runHead` starting in an immediately-saturated state.
Fix fillToHead deadlock
Member
Collaborator
Author
Collaborator
Author
|
@leobago captain we ready to ship |
leobago
approved these changes
Jun 11, 2026
leobago
left a comment
Member
There was a problem hiding this comment.
LGTM.
Just a small detail, please fix this pre-existing typo:
BidCommision (one s) on the struct field. Not introduced here, but still in this diff.
Pre-existing misspelling (BidCommision, one 's') on the struct field and local variables, flagged in the v3.8.2 release review. The DB column f_bid_commission was already spelled correctly; this is a pure Go-side rename with no schema or behaviour change.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the current
devline tomasteras v3.8.2. Six changes across arithmetic correctness, indexer stability, and crash resilience.Bug fixes
t_block_rewardstoUInt256/*big.Intso large values no longer wrap (migration000036). (Fix block rewards overflow #259, thanks @BitWonka)--max-request-retriesinstead of caching a state with empty committees, and guardGetAttestingIndices(Electra) /GetValidatorFromCommitteeIndex(Altair/Deneb) against nil/missing committees so a transient beacon-node failure no longer SIGSEGVs the indexer. Reported by the Obol team in [bug] Panic in ElectraMetrics.GetAttestingIndices when GetBeaconCommittee returns nil #271 (thanks @anadi2311); we shipped the fix on branchfix/issue-271-nil-committee-panic, which they validated in production on their fork before it landed here (fix: prevent panic when beacon committee data is missing (#271) #273). Includes unit tests for the guards.AdvanceFinalizedto prevent a concurrentCleanUpTorace in steady-state head mode. (Fix advance finalized redownload #263, thanks @BitWonka)val-windowflag--delete-cadence-epochs(default 32; set to 1 for the legacy delete-every-checkpoint behaviour). (fix: prevent goteth stalls on networks with large validator sets #269)Behaviour
fillToHeadloops until caught up: historical backfill now repeats until the live head gap is within one epoch before handing off to the head-following routine, honorings.stopand adding handoff headroom. (Fix fillToHead deadlock #264, thanks @BitWonka)Closes
Known issues (not addressed here)
HandleReorgleaves stale rewards int_block_rewardsfor reorged slots) is a separate reorg path from Fix advance finalized redownload #263 and remains open.Deploy notes
000036_alter_block_rewards_uint256—MODIFY COLUMNon the three reward columns oft_block_rewards(UInt64 → UInt256). This triggers a background ClickHouse mutation that rewrites those columns across all parts; on large tables it needs free disk and is not instantaneous. Monitorsystem.mutationsfor completion.val-windowdeployments: deletes now run every 32 finalized checkpoints by default. If you relied on per-checkpoint deletes, set--delete-cadence-epochs 1(envDELETE_CADENCE_EPOCHS).Test plan
000036applies cleanly on a master-versioned DBsystem.mutationsfor thet_block_rewardsrewrite completes without error