Skip to content

Exponential backoff cap at 32000 loses retry prioritization #1219

@evmparser

Description

@evmparser

Describe the issue

Description

The exponential backoff mechanism for retrying uncomputable handles is capped at 32,000. After approximately 15 failures (2^15 = 32,768), all persistently failing items have the same counter value, causing the system to lose priority distinction between them.

Location

  • File: coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rs
  • Function: update_uncomputable_handles()

Code Reference

UPDATE computations
SET schedule_order = CURRENT_TIMESTAMP + INTERVAL '1 second' * uncomputable_counter,
uncomputable_counter = LEAST(uncomputable_counter * 2, 32000)::SMALLINT

Impact

  • System cannot differentiate between item failing for 15 cycles vs 100 cycles
  • Recently failed items (more likely to succeed) not prioritized over long-standing failures
  • Performance degradation as worker wastes cycles on unlikely-to-succeed items

Why It Matters

Effective backoff and retry strategies are crucial for efficiency and resilience of distributed computation systems. A flawed strategy leads to wasted resources and slower processing.

Suggested Fix

Implement more robust backoff strategy:

  1. Increase the cap: Change to larger value or remove if data type allows
  2. Add jitter: Introduce randomness to prevent thundering herds
  3. Incorporate time: Add timestamp for last failure for better prioritization
UPDATE computations
SET schedule_order = ...,
    uncomputable_counter = LEAST(uncomputable_counter * 2, 65535)::SMALLINT,
    last_failed_at = CURRENT_TIMESTAMP

Reproduction Steps

  1. Create a computation handle designed to always fail
  2. Allow worker to attempt processing >15 times
  3. Observe uncomputable_counter capped at 32,000
  4. Create second handle that fails for first time
  5. After few cycles, it also reaches 32,000
  6. Both items now have same retry priority despite different failure histories

Context

No response

Steps to Reproduce or Propose

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions