Skip to content

Release v3.8.2: reward overflow, committee-fetch panic & stability fixes#272

Merged
Zyra-V21 merged 19 commits into
masterfrom
dev
Jun 12, 2026
Merged

Release v3.8.2: reward overflow, committee-fetch panic & stability fixes#272
Zyra-V21 merged 19 commits into
masterfrom
dev

Conversation

@Zyra-V21

@Zyra-V21 Zyra-V21 commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Promotes the current dev line to master as v3.8.2. Six changes across arithmetic correctness, indexer stability, and crash resilience.

Note: #268 (f_activation_eligibility_epoch) was reverted on dev and is not part of this release. It will land in a future PR (#274) together with the rest of the deposit-lifecycle work.

Bug fixes

Behaviour

  • fillToHead loops until caught up: historical backfill now repeats until the live head gap is within one epoch before handing off to the head-following routine, honoring s.stop and adding handoff headroom. (Fix fillToHead deadlock #264, thanks @BitWonka)

Closes

Known issues (not addressed here)

Deploy notes

  • Ships one DB migration, applied by golang-migrate:
    • 000036_alter_block_rewards_uint256MODIFY COLUMN on the three reward columns of t_block_rewards (UInt64 → UInt256). This triggers a background ClickHouse mutation that rewrites those columns across all parts; on large tables it needs free disk and is not instantaneous. Monitor system.mutations for completion.
  • val-window deployments: deletes now run every 32 finalized checkpoints by default. If you relied on per-checkpoint deletes, set --delete-cadence-epochs 1 (env DELETE_CADENCE_EPOCHS).

Test plan

  • Migration 000036 applies cleanly on a master-versioned DB
  • system.mutations for the t_block_rewards rewrite completes without error
  • Indexer keeps chain pace post-deploy (finalized epoch advancing)
  • Block reward columns hold large values without wraparound
  • Historical backfill hands off to head-following only when within one epoch of head
  • Indexer survives a transient committees-fetch failure (retries, no panic)

BitWonka and others added 14 commits April 16, 2026 03:11
…mutation pressure

The val-window service emits a lightweight DELETE on
t_validator_rewards_summary every finalized checkpoint event (~6:24 min).
On networks with ~1M validators this lowers to a ClickHouse mutation that
rewrites the in-window parts (14-55 GiB each on hoodi); each fire takes
minutes and queues up faster than it can drain, saturating one merge core
and stalling the head until an operator restart.

Make the boundary advance gate the fire: only emit DELETE once the
window's lower boundary has advanced by DELETE_CADENCE_EPOCHS since the
last successful fire (default 32 epochs ~3.4h, ~70x headroom over the
~150s mutation cost on hoodi). The first event after start always fires
to anchor the baseline; subsequent events skip until the cadence is met.

The DELETE statement and boundary calculation are unchanged - the only
observable difference is up to (cadence-1) extra epochs retained beyond
the strict window (0.16% overshoot vs. the 20250-epoch window). The
per-epoch surgical delete used by reorg recovery (DeleteStateMetrics) is
untouched. Set DELETE_CADENCE_EPOCHS=1 for legacy behaviour.
…UpTo race

Each FinalizedCheckpointEvent in head mode launches a new
`go AdvanceFinalized(...)`. When a previous invocation is still running
(common when ProcessStateTransitionMetrics takes longer than the ~6:24
min finalized interval — networks with ~1M validators, or any catch-up
scenario), two goroutines race over the same StateHistory: the newer one
runs CleanUpTo at the end of its loop and evicts entries that the older
one is still blocked on inside StateHistory.Wait / BlockHistory.Wait.
The blocked goroutine then waits forever holding a processerBook slot,
and successive races leak the whole 32-slot pool, surfacing as floods
of "Waiting for too long to acquire page" warnings and a stuck head.

Observed on goteth-hoodi this morning: a single dependency state at
epoch 93105 was evicted while a ProcessStateTransitionMetrics goroutine
held a Wait on it, blocking that slot for 30+ minutes; the analyzer
stopped advancing past dbHeadEpoch 93110 even with ClickHouse healthy.

Skip overlapping invocations via TryLock. The skipped one would have
iterated a subset of the state keys the next invocation will see, and
its CleanUpTo would have been a subset of what the next one performs,
so dropping it is monotonically safe — no work is lost.

The historical-mode synchronous call site (routines.go:208) is
unaffected: head mode only starts after historical completes, so
TryLock always succeeds there.
…tatus

Validators between deposit and activation can be in one of two
spec-defined sub-states: pending_initialized (eligibility epoch is
FAR_FUTURE_EPOCH) or pending_queued (eligibility epoch is set).
goteth read ActivationEligibilityEpoch from the beacon state into
local memory but never persisted it, so downstream consumers could
not split the two sub-states.

This commit:
- adds f_activation_eligibility_epoch (UInt64, default
  FAR_FUTURE_EPOCH) to t_validator_last_status via migration 000036
- extends the ValidatorLastStatus struct, ToArray, and the
  ClickHouse INSERT to carry the new field
- reads validator.ActivationEligibilityEpoch in
  processValLastStatus
- adds three invariant tests in tests/db_validator_test.py
- documents the column in docs/tables.md

Fixes #266
* fix: block rewards overflow

* use helper for bigInt conversion

* use uint256 instead of string

* update docs

---------


Co-authored-by: Zyra-V21 <zyrav21@proton.me>
PR #259 (block-rewards overflow fix) merged into dev today and claimed
migration number 036 for alter_block_rewards_uint256. Renumber this PR's
migration to 037 to avoid the collision and keep numerical ordering
deterministic on rebase.

No content change; pure file rename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…letes

fix: prevent goteth stalls on networks with large validator sets
@Zyra-V21 Zyra-V21 self-assigned this May 26, 2026
Zyra-V21 and others added 4 commits June 1, 2026 11:13
Two issues addressed on top of the existing outer-loop change:

1. Shutdown busy-loop. The outer `for` loop did not check `s.stop`
   between iterations. `runHistorical` returns immediately on
   `s.stop`, but since the chain keeps advancing the new outer loop
   re-queried `RequestCurrentHead` and called `runHistorical` again,
   producing a tight CPU-bound spin on shutdown. Add an explicit
   `if s.stop { return headSlot }` guard right after `runHistorical`.

2. Handoff threshold sits exactly on the pool capacity. The previous
   threshold was `SlotsPerEpoch` (32 slots), which is also the size
   of `processerBook` (`utils.NewRoutineBook(32, ...)` in
   chain_analyzer.go). Returning with a 32-slot gap lets `runHead`'s
   first enqueue burst fill every page in the pool; if any of those
   slots hit a cross-epoch `BlockHistory.Wait` dependency, the pool
   deadlocks — the failure mode this loop was added to avoid in the
   first place. Drop the threshold to `SlotsPerEpoch / 2` so there
   is room for the cross-epoch dependencies to land without the
   first dispatch burst sitting on the edge of the pool.

The threshold change adds at most one or two extra iterations near
the end of catch-up (each iteration is bounded by `runHistorical`
draining its slot range) and removes the only path that can leave
`runHead` starting in an immediately-saturated state.
…ility-epoch"

This reverts commit 6ee1aec, reversing
changes made to ae219e8.
@leobago

leobago commented Jun 9, 2026

Copy link
Copy Markdown
Member

@Zyra-V21 the description of this PR should be corrected, it does not include Fix #268

@Zyra-V21

Zyra-V21 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@Zyra-V21 the description of this PR should be corrected, it does not include Fix #268

Sure! I will restructure the PR description and title.

@Zyra-V21 Zyra-V21 changed the title Release v3.8.2: activation eligibility, reward overflow & stability fixes Release v3.8.2: reward overflow, committee-fetch panic & stability fixes Jun 11, 2026
@Zyra-V21

Copy link
Copy Markdown
Collaborator Author

@leobago captain we ready to ship

@leobago leobago left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just a small detail, please fix this pre-existing typo:

BidCommision (one s) on the struct field. Not introduced here, but still in this diff.

Pre-existing misspelling (BidCommision, one 's') on the struct field
and local variables, flagged in the v3.8.2 release review. The DB
column f_bid_commission was already spelled correctly; this is a
pure Go-side rename with no schema or behaviour change.
@Zyra-V21

Copy link
Copy Markdown
Collaborator Author

Good catch @leobago — fixed in fa95af8. Renamed the struct field and the locals (BidCommisionBidCommission); the f_bid_commission column was already spelled correctly, so it's a pure Go-side rename with no schema impact.

@leobago leobago left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Zyra-V21 Zyra-V21 merged commit 563d3fe into master Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants