-
Notifications
You must be signed in to change notification settings - Fork 10
CCMRG-1863 Fix upload task resilience to pod failures #557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
**Problem:** Upload task 15fa00a1-b967-4ab2-9199-543d7e41e89d experienced a 6-hour delay when a pod failure occurred. The retry was received by a worker pod that died before processing, causing the task to be lost until visibility timeout expired. **Root Causes:** 1. The `retries == 0` condition only checked for processing on first attempt 2. 6-hour visibility timeout meant lost tasks took 6 hours to recover 3. Tasks weren't automatically returned to queue when workers died 4. No metrics to monitor stuck or repeatedly retried uploads **Changes:** 1. **Removed `retries == 0` condition** (apps/worker/tasks/upload.py) - Now EVERY retry checks if upload is currently processing - Added max retry limit of 10 to prevent infinite loops - Logs include retry count for debugging 2. **Reduced visibility timeout** (libs/shared/shared/celery_config.py) - Changed from 6 hours to 15 minutes - Lost tasks are now retried in 15 minutes instead of 6 hours 3. **Added task_reject_on_worker_lost** (libs/shared/shared/celery_config.py) - When pods die, tasks are immediately returned to queue - Another worker picks up the task within seconds 4. **Added monitoring metrics** (apps/worker/tasks/upload.py) - UPLOAD_TASK_PROCESSING_RETRY_COUNTER: tracks retry patterns - UPLOAD_TASK_TOO_MANY_RETRIES_COUNTER: tracks abandoned uploads **Protection Against Double Processing:** Multiple layers prevent race conditions: - upload_processing_lock check detects if processor is running - upload_lock (Redis lock) serializes Upload task execution - has_pending_jobs() recheck inside lock verifies jobs still exist - Atomic lpop ensures same job isn't processed twice **Impact:** - Pod failures now recover in seconds/minutes instead of 6 hours - Tasks survive worker crashes without data loss - Better observability via metrics - Fully backward compatible See UPLOAD_TASK_RESILIENCE_CHANGES.md for complete analysis and testing plan.
|
Change 3 should also address CCMRG-1848 |
|
@sentry review |
CodSpeed Performance ReportMerging #557 will not alter performanceComparing Summary
Footnotes |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #557 +/- ##
==========================================
- Coverage 93.84% 93.83% -0.02%
==========================================
Files 1284 1284
Lines 46325 46364 +39
Branches 1524 1524
==========================================
+ Hits 43476 43507 +31
- Misses 2540 2548 +8
Partials 309 309
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
**Issue:** AI agent identified that uploads hitting 10-retry limit would be
silently dropped without recovery mechanism, potentially causing data loss.
**Root Cause:** When task gave up after 10 retries, uploads remained in Redis
but were never processed. They would either:
- Wait for another upload task (unlikely if processing lock stuck)
- Expire after 24 hours (data loss)
**Solution:** Implemented Dead Letter Queue (DLQ)
1. **Atomic move to DLQ**: When hitting retry limit, lpop all pending uploads
and push to DLQ key: upload_dlq/{repoid}/{commitid}/{report_type}
2. **7-day retention**: DLQ entries expire after 7 days, giving team time to
inspect and recover
3. **Monitoring**: Added UPLOAD_TASK_DLQ_COUNTER metric to track entries
4. **Logging**: Error logs include DLQ key and upload count for debugging
**Benefits:**
- Prevents silent data loss
- Enables manual recovery/inspection
- Visibility via metrics and alerts
- Future: Can add automated recovery script
**Testing:**
- No linter errors
- Backward compatible (DLQ is additive)
- Metric increments properly
- Redis TTL prevents DLQ buildup
This addresses the valid concern raised by the AI agent and ensures no uploads
are lost, even in worst-case scenarios.
PROBLEM: Upload tasks could be lost for extended periods when worker pods died during retry processing due to: - retries==0 check only ran on first attempt, skipping on pod failures - 6-hour visibility timeout delayed recovery - No automatic task recovery when workers died SOLUTION: 1. Fixed Retry Logic - Removed retries==0 condition - checks processing on every retry - Added max retry limit (10) to prevent infinite loops - Enhanced logging with retry count tracking 2. Faster Pod Failure Recovery - Reduced visibility_timeout: 6 hours → 15 minutes (configurable) - Enabled task_reject_on_worker_lost for immediate re-queuing 3. Configuration Centralization - All resilience values in shared/celery_config.py - YAML/env override support with inline documentation - Links to official Celery/Redis docs in code comments 4. Enhanced Observability - worker_task_counts_retries_by_count - retry patterns - upload_task_processing_retry - processing lock retries - upload_task_too_many_retries - tasks hitting max retries - worker_task_counts_max_retries_exceeded - widespread issues 5. Generalized for All Tasks - Enhanced BaseCodecovTask.safe_retry() with metrics - Configurable max retries and exponential backoff - Available to all tasks for consistent retry behavior IMPACT: - Pod failures now recover in 15 minutes instead of 6 hours - Retry logic works correctly after pod deaths - All configuration centralized and documented - New metrics enable alerting on stuck/failing uploads NEXT STEPS (separate PRs): - Dead Letter Queue for tasks exceeding max retries - Prometheus alerting on new metrics SAFETY: - Backward compatible (no breaking changes) - Race conditions prevented by existing Redis locks - Rollback plan documented in PR description
…eout 1. Update test_run_impl_currently_processing_second_retry: - Test now verifies processing check runs on ANY retry (not just first) - Task should raise Retry exception when processing is ongoing - This properly tests the fix for the retries==0 bug 2. Update test_celery_config: - Update expected visibility_timeout: 21600 (6h) → 900 (15m) - Add comment explaining the change for pod failure recovery Both changes align with the resilience improvements in the main PR.
Add test_run_impl_too_many_retries_logs_checkpoint to verify that UploadFlow.TOO_MANY_RETRIES checkpoint is logged when a task hits the max retry limit (10 retries) while processing is ongoing. This addresses missing test coverage identified by code review for the new retry limit behavior added in the resilience improvements. The test verifies: - maybe_log_upload_checkpoint is called with TOO_MANY_RETRIES - Task returns with reason 'too_many_processing_retries' - Task doesn't proceed with setup when max retries exceeded
| # Can be overridden via: setup.tasks.celery.reject_on_worker_lost config | ||
| # Celery docs: https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-reject-on-worker-lost | ||
| task_reject_on_worker_lost = bool( | ||
| get_config("setup", "tasks", "celery", "reject_on_worker_lost", default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs do mention "Enabling this can cause message loops", so maybe it's safer to have the default be False and activate it in specific tasks?
(like the Upload task that won't run in a loop forever because the arguments in redis eventually expire)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah now I see the IMPORTANT note above... not all tasks have acks_late set to True, do they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. config is here https://github.com/codecov/k8s-v2/blob/16c9272495aa8f2fc6bbc5fefeee64ed8597bace/config/production.yaml#L524
Make it clearer that there are two recovery mechanisms: 1. task_reject_on_worker_lost - immediate retry when pod dies 2. visibility_timeout (15 min) - safety net for lost/unacknowledged tasks Previous wording was ambiguous about when 15 minutes applied.
The file contained detailed documentation on upload task resilience improvements, which have been fully integrated into the codebase and commit history. This cleanup helps streamline project documentation.
Upload Task Resilience Improvements
Problem
Upload tasks could be lost for extended periods when worker pods died during retry processing.
Root causes:
retries == 0check only verified processing status on first attemptSolution
1. Fixed Retry Logic (
apps/worker/tasks/upload.py)retries == 0condition - processing status checked on every retryImpact: Pod failures during retry no longer cause tasks to skip processing checks.
2. Faster Pod Failure Recovery (
libs/shared/shared/celery_config.py)task_reject_on_worker_lost- tasks immediately re-queued on pod deathImpact: Tasks lost to pod crashes retry in 15 minutes instead of 6 hours.
3. Configuration Centralization (
libs/shared/shared/celery_config.py)All resilience values now configurable via YAML or environment variables:
See inline code comments for full configuration options and docs links.
4. Enhanced Observability
New metrics for monitoring:
worker_task_counts_retries_by_count{task, retry_count}- Track retry patternsupload_task_processing_retry{report_type, retry_count}- Processing lock retriesupload_task_too_many_retries{report_type}- Tasks hitting max retries5. Generalized for All Tasks (
apps/worker/tasks/base.py)Enhanced
BaseCodecovTask.safe_retry():All tasks can now use improved retry logic:
Safety
Race condition protection:
lpopoperations prevent duplicate job consumptionRollback:
If issues arise, revert changes to:
libs/shared/shared/celery_config.py(restores 6-hour visibility timeout)apps/worker/tasks/upload.py(restoresretries == 0check)Files Modified
Core changes:
libs/shared/shared/celery_config.py- Configuration centralization with inline docsapps/worker/tasks/base.py- Enhanced BaseCodecovTask with safe_retry()apps/worker/tasks/upload.py- Fixed retry logic (removed retries==0 check)Next Steps
Future improvements (separate PRs):
upload_task_processing_retry{retry_count="5"}> threshold (stuck uploads)upload_task_too_many_retriesincreases (tasks failing repeatedly)worker_task_counts_max_retries_exceededincreases (widespread issues)Backward Compatible: ✅ Yes
Database Changes: ❌ None
Breaking Changes: ❌ None
Configuration Required: Ensure
task_acks_late: trueis set in deployment config