-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Description
Describe the issue
Description
The exponential backoff mechanism for retrying uncomputable handles is capped at 32,000. After approximately 15 failures (2^15 = 32,768), all persistently failing items have the same counter value, causing the system to lose priority distinction between them.
Location
- File:
coprocessor/fhevm-engine/tfhe-worker/src/tfhe_worker.rs - Function:
update_uncomputable_handles()
Code Reference
UPDATE computations
SET schedule_order = CURRENT_TIMESTAMP + INTERVAL '1 second' * uncomputable_counter,
uncomputable_counter = LEAST(uncomputable_counter * 2, 32000)::SMALLINTImpact
- System cannot differentiate between item failing for 15 cycles vs 100 cycles
- Recently failed items (more likely to succeed) not prioritized over long-standing failures
- Performance degradation as worker wastes cycles on unlikely-to-succeed items
Why It Matters
Effective backoff and retry strategies are crucial for efficiency and resilience of distributed computation systems. A flawed strategy leads to wasted resources and slower processing.
Suggested Fix
Implement more robust backoff strategy:
- Increase the cap: Change to larger value or remove if data type allows
- Add jitter: Introduce randomness to prevent thundering herds
- Incorporate time: Add timestamp for last failure for better prioritization
UPDATE computations
SET schedule_order = ...,
uncomputable_counter = LEAST(uncomputable_counter * 2, 65535)::SMALLINT,
last_failed_at = CURRENT_TIMESTAMPReproduction Steps
- Create a computation handle designed to always fail
- Allow worker to attempt processing >15 times
- Observe
uncomputable_countercapped at 32,000 - Create second handle that fails for first time
- After few cycles, it also reaches 32,000
- Both items now have same retry priority despite different failure histories
Context
No response
Steps to Reproduce or Propose
No response
Metadata
Metadata
Assignees
Labels
No labels