CCMRG-1863 Fix upload task resilience to pod failures #557

drazisil-codecov · 2025-11-11T14:19:10Z

Upload Task Resilience Improvements

Problem

Upload tasks could be lost for extended periods when worker pods died during retry processing.

Root causes:

retries == 0 check only verified processing status on first attempt
6-hour visibility timeout delayed recovery from pod failures
No automatic task recovery when workers died
No metrics for stuck or retrying uploads

Solution

1. Fixed Retry Logic (`apps/worker/tasks/upload.py`)

Removed retries == 0 condition - processing status checked on every retry
Added max retry limit (10) to prevent infinite loops
Added retry count metrics for observability

Impact: Pod failures during retry no longer cause tasks to skip processing checks.

2. Faster Pod Failure Recovery (`libs/shared/shared/celery_config.py`)

Reduced visibility timeout: 6 hours → 15 minutes (configurable)
Enabled task_reject_on_worker_lost - tasks immediately re-queued on pod death

Impact: Tasks lost to pod crashes retry in 15 minutes instead of 6 hours.

3. Configuration Centralization (`libs/shared/shared/celery_config.py`)

All resilience values now configurable via YAML or environment variables:

setup:
  tasks:
    celery:
      visibility_timeout: 900        # 15 min (default)
      max_retries: 10
      reject_on_worker_lost: true
      retry_backoff_base: 20         # Exponential backoff base
    upload:
      processing_max_retries: 10
      processing_retry_delay: 60

See inline code comments for full configuration options and docs links.

4. Enhanced Observability

New metrics for monitoring:

worker_task_counts_retries_by_count{task, retry_count} - Track retry patterns
upload_task_processing_retry{report_type, retry_count} - Processing lock retries
upload_task_too_many_retries{report_type} - Tasks hitting max retries

5. Generalized for All Tasks (`apps/worker/tasks/base.py`)

Enhanced BaseCodecovTask.safe_retry():

Configurable max retries with graceful failure
Exponential backoff using centralized config
Comprehensive metrics tracking
Proper error handling

All tasks can now use improved retry logic:

if not self.safe_retry(max_retries=5, countdown=60):
    # Max retries exceeded - handle gracefully
    return {"success": False, "reason": "max_retries"}

Safety

Race condition protection:

Existing Redis locks prevent concurrent processing
Atomic lpop operations prevent duplicate job consumption
Processing lock expires after ~12 minutes
Upload queues expire after 24 hours

Rollback:
If issues arise, revert changes to:

libs/shared/shared/celery_config.py (restores 6-hour visibility timeout)
apps/worker/tasks/upload.py (restores retries == 0 check)

Files Modified

Core changes:

libs/shared/shared/celery_config.py - Configuration centralization with inline docs
apps/worker/tasks/base.py - Enhanced BaseCodecovTask with safe_retry()
apps/worker/tasks/upload.py - Fixed retry logic (removed retries==0 check)

Next Steps

Future improvements (separate PRs):

Dead Letter Queue (DLQ) - Add recovery mechanism for tasks that hit max retries, preventing data loss
Alerting - Set up Prometheus alerts on new metrics:
- Alert when upload_task_processing_retry{retry_count="5"} > threshold (stuck uploads)
- Alert when upload_task_too_many_retries increases (tasks failing repeatedly)
- Alert when worker_task_counts_max_retries_exceeded increases (widespread issues)

Backward Compatible: ✅ Yes
Database Changes: ❌ None
Breaking Changes: ❌ None
Configuration Required: Ensure task_acks_late: true is set in deployment config

**Problem:** Upload task 15fa00a1-b967-4ab2-9199-543d7e41e89d experienced a 6-hour delay when a pod failure occurred. The retry was received by a worker pod that died before processing, causing the task to be lost until visibility timeout expired. **Root Causes:** 1. The `retries == 0` condition only checked for processing on first attempt 2. 6-hour visibility timeout meant lost tasks took 6 hours to recover 3. Tasks weren't automatically returned to queue when workers died 4. No metrics to monitor stuck or repeatedly retried uploads **Changes:** 1. **Removed `retries == 0` condition** (apps/worker/tasks/upload.py) - Now EVERY retry checks if upload is currently processing - Added max retry limit of 10 to prevent infinite loops - Logs include retry count for debugging 2. **Reduced visibility timeout** (libs/shared/shared/celery_config.py) - Changed from 6 hours to 15 minutes - Lost tasks are now retried in 15 minutes instead of 6 hours 3. **Added task_reject_on_worker_lost** (libs/shared/shared/celery_config.py) - When pods die, tasks are immediately returned to queue - Another worker picks up the task within seconds 4. **Added monitoring metrics** (apps/worker/tasks/upload.py) - UPLOAD_TASK_PROCESSING_RETRY_COUNTER: tracks retry patterns - UPLOAD_TASK_TOO_MANY_RETRIES_COUNTER: tracks abandoned uploads **Protection Against Double Processing:** Multiple layers prevent race conditions: - upload_processing_lock check detects if processor is running - upload_lock (Redis lock) serializes Upload task execution - has_pending_jobs() recheck inside lock verifies jobs still exist - Atomic lpop ensures same job isn't processed twice **Impact:** - Pod failures now recover in seconds/minutes instead of 6 hours - Tasks survive worker crashes without data loss - Better observability via metrics - Fully backward compatible See UPLOAD_TASK_RESILIENCE_CHANGES.md for complete analysis and testing plan.

linear · 2025-11-11T14:21:16Z

CCMRG-1863 Fix upload task resilience to pod failures

drazisil-codecov · 2025-11-11T14:22:37Z

Change 3 should also address CCMRG-1848

drazisil-codecov · 2025-11-11T14:22:51Z

@sentry review

apps/worker/tasks/upload.py

libs/shared/shared/celery_config.py

apps/worker/tasks/upload.py

UPLOAD_TASK_RESILIENCE_CHANGES.md

codspeed-hq · 2025-11-11T14:29:30Z

CodSpeed Performance Report

Merging #557 will not alter performance

_{Comparing fix-upload-task-pod-failure-resilience (8cc2510) with main (5ba2af3)¹}

Summary

✅ 9 untouched

No successful run was found on main (21a85a5) during the generation of this report, so 5ba2af3 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

sentry · 2025-11-11T14:30:51Z

Codecov Report

❌ Patch coverage is 81.81818% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.83%. Comparing base (21a85a5) to head (8cc2510).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/base.py	66.66%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #557      +/-   ##
==========================================
- Coverage   93.84%   93.83%   -0.02%     
==========================================
  Files        1284     1284              
  Lines       46325    46364      +39     
  Branches     1524     1524              
==========================================
+ Hits        43476    43507      +31     
- Misses       2540     2548       +8     
  Partials      309      309

Flag	Coverage Δ
apiunit	`96.54% <ø> (ø)`
sharedintegration	`38.73% <100.00%> (+0.03%)`	⬆️
sharedunit	`88.78% <100.00%> (+<0.01%)`	⬆️
workerintegration	`58.67% <22.85%> (-0.09%)`	⬇️
workerunit	`91.11% <77.14%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codecov-notifications · 2025-11-11T14:31:03Z

Codecov Report

❌ Patch coverage is 81.81818% with 8 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
apps/worker/tasks/base.py	66.66%	8 Missing ⚠️

📢 Thoughts on this report? Let us know!

**Issue:** AI agent identified that uploads hitting 10-retry limit would be silently dropped without recovery mechanism, potentially causing data loss. **Root Cause:** When task gave up after 10 retries, uploads remained in Redis but were never processed. They would either: - Wait for another upload task (unlikely if processing lock stuck) - Expire after 24 hours (data loss) **Solution:** Implemented Dead Letter Queue (DLQ) 1. **Atomic move to DLQ**: When hitting retry limit, lpop all pending uploads and push to DLQ key: upload_dlq/{repoid}/{commitid}/{report_type} 2. **7-day retention**: DLQ entries expire after 7 days, giving team time to inspect and recover 3. **Monitoring**: Added UPLOAD_TASK_DLQ_COUNTER metric to track entries 4. **Logging**: Error logs include DLQ key and upload count for debugging **Benefits:** - Prevents silent data loss - Enables manual recovery/inspection - Visibility via metrics and alerts - Future: Can add automated recovery script **Testing:** - No linter errors - Backward compatible (DLQ is additive) - Metric increments properly - Redis TTL prevents DLQ buildup This addresses the valid concern raised by the AI agent and ensures no uploads are lost, even in worst-case scenarios.

apps/worker/tasks/upload.py

libs/shared/shared/celery_config.py

PROBLEM: Upload tasks could be lost for extended periods when worker pods died during retry processing due to: - retries==0 check only ran on first attempt, skipping on pod failures - 6-hour visibility timeout delayed recovery - No automatic task recovery when workers died SOLUTION: 1. Fixed Retry Logic - Removed retries==0 condition - checks processing on every retry - Added max retry limit (10) to prevent infinite loops - Enhanced logging with retry count tracking 2. Faster Pod Failure Recovery - Reduced visibility_timeout: 6 hours → 15 minutes (configurable) - Enabled task_reject_on_worker_lost for immediate re-queuing 3. Configuration Centralization - All resilience values in shared/celery_config.py - YAML/env override support with inline documentation - Links to official Celery/Redis docs in code comments 4. Enhanced Observability - worker_task_counts_retries_by_count - retry patterns - upload_task_processing_retry - processing lock retries - upload_task_too_many_retries - tasks hitting max retries - worker_task_counts_max_retries_exceeded - widespread issues 5. Generalized for All Tasks - Enhanced BaseCodecovTask.safe_retry() with metrics - Configurable max retries and exponential backoff - Available to all tasks for consistent retry behavior IMPACT: - Pod failures now recover in 15 minutes instead of 6 hours - Retry logic works correctly after pod deaths - All configuration centralized and documented - New metrics enable alerting on stuck/failing uploads NEXT STEPS (separate PRs): - Dead Letter Queue for tasks exceeding max retries - Prometheus alerting on new metrics SAFETY: - Backward compatible (no breaking changes) - Race conditions prevented by existing Redis locks - Rollback plan documented in PR description

…eout 1. Update test_run_impl_currently_processing_second_retry: - Test now verifies processing check runs on ANY retry (not just first) - Task should raise Retry exception when processing is ongoing - This properly tests the fix for the retries==0 bug 2. Update test_celery_config: - Update expected visibility_timeout: 21600 (6h) → 900 (15m) - Add comment explaining the change for pod failure recovery Both changes align with the resilience improvements in the main PR.

Add test_run_impl_too_many_retries_logs_checkpoint to verify that UploadFlow.TOO_MANY_RETRIES checkpoint is logged when a task hits the max retry limit (10 retries) while processing is ongoing. This addresses missing test coverage identified by code review for the new retry limit behavior added in the resilience improvements. The test verifies: - maybe_log_upload_checkpoint is called with TOO_MANY_RETRIES - Task returns with reason 'too_many_processing_retries' - Task doesn't proceed with setup when max retries exceeded

giovanni-guidini · 2025-11-11T16:37:34Z

libs/shared/shared/celery_config.py

+    # Can be overridden via: setup.tasks.celery.reject_on_worker_lost config
+    # Celery docs: https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-reject-on-worker-lost
+    task_reject_on_worker_lost = bool(
+        get_config("setup", "tasks", "celery", "reject_on_worker_lost", default=True)


The docs do mention "Enabling this can cause message loops", so maybe it's safer to have the default be False and activate it in specific tasks?

(like the Upload task that won't run in a loop forever because the arguments in redis eventually expire)

Ah now I see the IMPORTANT note above... not all tasks have acks_late set to True, do they?

I don't think so. config is here https://github.com/codecov/k8s-v2/blob/16c9272495aa8f2fc6bbc5fefeee64ed8597bace/config/production.yaml#L524

PR_DESCRIPTION.md

libs/shared/tests/unit/test_celery_config.py

Make it clearer that there are two recovery mechanisms: 1. task_reject_on_worker_lost - immediate retry when pod dies 2. visibility_timeout (15 min) - safety net for lost/unacknowledged tasks Previous wording was ambiguous about when 15 minutes applied.

The file contained detailed documentation on upload task resilience improvements, which have been fully integrated into the codebase and commit history. This cleanup helps streamline project documentation.

drazisil-codecov changed the title ~~Fix upload task resilience to pod failures~~ CCMRG-1863 Fix upload task resilience to pod failures Nov 11, 2025