Skip to content

Conversation

@drazisil-codecov
Copy link
Contributor

@drazisil-codecov drazisil-codecov commented Nov 11, 2025

Upload Task Resilience Improvements

Problem

Upload tasks could be lost for extended periods when worker pods died during retry processing.

Root causes:

  • retries == 0 check only verified processing status on first attempt
  • 6-hour visibility timeout delayed recovery from pod failures
  • No automatic task recovery when workers died
  • No metrics for stuck or retrying uploads

Solution

1. Fixed Retry Logic (apps/worker/tasks/upload.py)

  • Removed retries == 0 condition - processing status checked on every retry
  • Added max retry limit (10) to prevent infinite loops
  • Added retry count metrics for observability

Impact: Pod failures during retry no longer cause tasks to skip processing checks.

2. Faster Pod Failure Recovery (libs/shared/shared/celery_config.py)

  • Reduced visibility timeout: 6 hours → 15 minutes (configurable)
  • Enabled task_reject_on_worker_lost - tasks immediately re-queued on pod death

Impact: Tasks lost to pod crashes retry in 15 minutes instead of 6 hours.

3. Configuration Centralization (libs/shared/shared/celery_config.py)

All resilience values now configurable via YAML or environment variables:

setup:
  tasks:
    celery:
      visibility_timeout: 900        # 15 min (default)
      max_retries: 10
      reject_on_worker_lost: true
      retry_backoff_base: 20         # Exponential backoff base
    upload:
      processing_max_retries: 10
      processing_retry_delay: 60

See inline code comments for full configuration options and docs links.

4. Enhanced Observability

New metrics for monitoring:

  • worker_task_counts_retries_by_count{task, retry_count} - Track retry patterns
  • upload_task_processing_retry{report_type, retry_count} - Processing lock retries
  • upload_task_too_many_retries{report_type} - Tasks hitting max retries

5. Generalized for All Tasks (apps/worker/tasks/base.py)

Enhanced BaseCodecovTask.safe_retry():

  • Configurable max retries with graceful failure
  • Exponential backoff using centralized config
  • Comprehensive metrics tracking
  • Proper error handling

All tasks can now use improved retry logic:

if not self.safe_retry(max_retries=5, countdown=60):
    # Max retries exceeded - handle gracefully
    return {"success": False, "reason": "max_retries"}

Safety

Race condition protection:

  • Existing Redis locks prevent concurrent processing
  • Atomic lpop operations prevent duplicate job consumption
  • Processing lock expires after ~12 minutes
  • Upload queues expire after 24 hours

Rollback:
If issues arise, revert changes to:

  • libs/shared/shared/celery_config.py (restores 6-hour visibility timeout)
  • apps/worker/tasks/upload.py (restores retries == 0 check)

Files Modified

Core changes:

  • libs/shared/shared/celery_config.py - Configuration centralization with inline docs
  • apps/worker/tasks/base.py - Enhanced BaseCodecovTask with safe_retry()
  • apps/worker/tasks/upload.py - Fixed retry logic (removed retries==0 check)

Next Steps

Future improvements (separate PRs):

  1. Dead Letter Queue (DLQ) - Add recovery mechanism for tasks that hit max retries, preventing data loss
  2. Alerting - Set up Prometheus alerts on new metrics:
    • Alert when upload_task_processing_retry{retry_count="5"} > threshold (stuck uploads)
    • Alert when upload_task_too_many_retries increases (tasks failing repeatedly)
    • Alert when worker_task_counts_max_retries_exceeded increases (widespread issues)

Backward Compatible: ✅ Yes
Database Changes: ❌ None
Breaking Changes: ❌ None
Configuration Required: Ensure task_acks_late: true is set in deployment config

**Problem:**
Upload task 15fa00a1-b967-4ab2-9199-543d7e41e89d experienced a 6-hour delay
when a pod failure occurred. The retry was received by a worker pod that died
before processing, causing the task to be lost until visibility timeout expired.

**Root Causes:**
1. The `retries == 0` condition only checked for processing on first attempt
2. 6-hour visibility timeout meant lost tasks took 6 hours to recover
3. Tasks weren't automatically returned to queue when workers died
4. No metrics to monitor stuck or repeatedly retried uploads

**Changes:**

1. **Removed `retries == 0` condition** (apps/worker/tasks/upload.py)
   - Now EVERY retry checks if upload is currently processing
   - Added max retry limit of 10 to prevent infinite loops
   - Logs include retry count for debugging

2. **Reduced visibility timeout** (libs/shared/shared/celery_config.py)
   - Changed from 6 hours to 15 minutes
   - Lost tasks are now retried in 15 minutes instead of 6 hours

3. **Added task_reject_on_worker_lost** (libs/shared/shared/celery_config.py)
   - When pods die, tasks are immediately returned to queue
   - Another worker picks up the task within seconds

4. **Added monitoring metrics** (apps/worker/tasks/upload.py)
   - UPLOAD_TASK_PROCESSING_RETRY_COUNTER: tracks retry patterns
   - UPLOAD_TASK_TOO_MANY_RETRIES_COUNTER: tracks abandoned uploads

**Protection Against Double Processing:**
Multiple layers prevent race conditions:
- upload_processing_lock check detects if processor is running
- upload_lock (Redis lock) serializes Upload task execution
- has_pending_jobs() recheck inside lock verifies jobs still exist
- Atomic lpop ensures same job isn't processed twice

**Impact:**
- Pod failures now recover in seconds/minutes instead of 6 hours
- Tasks survive worker crashes without data loss
- Better observability via metrics
- Fully backward compatible

See UPLOAD_TASK_RESILIENCE_CHANGES.md for complete analysis and testing plan.
@drazisil-codecov drazisil-codecov changed the title Fix upload task resilience to pod failures CCMRG-1863 Fix upload task resilience to pod failures Nov 11, 2025
@linear
Copy link

linear bot commented Nov 11, 2025

@drazisil-codecov
Copy link
Contributor Author

Change 3 should also address CCMRG-1848

@drazisil-codecov
Copy link
Contributor Author

@sentry review

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 11, 2025

CodSpeed Performance Report

Merging #557 will not alter performance

Comparing fix-upload-task-pod-failure-resilience (8cc2510) with main (5ba2af3)1

Summary

✅ 9 untouched

Footnotes

  1. No successful run was found on main (21a85a5) during the generation of this report, so 5ba2af3 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@sentry
Copy link

sentry bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.83%. Comparing base (21a85a5) to head (8cc2510).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/base.py 66.66% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #557      +/-   ##
==========================================
- Coverage   93.84%   93.83%   -0.02%     
==========================================
  Files        1284     1284              
  Lines       46325    46364      +39     
  Branches     1524     1524              
==========================================
+ Hits        43476    43507      +31     
- Misses       2540     2548       +8     
  Partials      309      309              
Flag Coverage Δ
apiunit 96.54% <ø> (ø)
sharedintegration 38.73% <100.00%> (+0.03%) ⬆️
sharedunit 88.78% <100.00%> (+<0.01%) ⬆️
workerintegration 58.67% <22.85%> (-0.09%) ⬇️
workerunit 91.11% <77.14%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link

codecov-notifications bot commented Nov 11, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 8 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/worker/tasks/base.py 66.66% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

**Issue:** AI agent identified that uploads hitting 10-retry limit would be
silently dropped without recovery mechanism, potentially causing data loss.

**Root Cause:** When task gave up after 10 retries, uploads remained in Redis
but were never processed. They would either:
- Wait for another upload task (unlikely if processing lock stuck)
- Expire after 24 hours (data loss)

**Solution:** Implemented Dead Letter Queue (DLQ)

1. **Atomic move to DLQ**: When hitting retry limit, lpop all pending uploads
   and push to DLQ key: upload_dlq/{repoid}/{commitid}/{report_type}

2. **7-day retention**: DLQ entries expire after 7 days, giving team time to
   inspect and recover

3. **Monitoring**: Added UPLOAD_TASK_DLQ_COUNTER metric to track entries

4. **Logging**: Error logs include DLQ key and upload count for debugging

**Benefits:**
- Prevents silent data loss
- Enables manual recovery/inspection
- Visibility via metrics and alerts
- Future: Can add automated recovery script

**Testing:**
- No linter errors
- Backward compatible (DLQ is additive)
- Metric increments properly
- Redis TTL prevents DLQ buildup

This addresses the valid concern raised by the AI agent and ensures no uploads
are lost, even in worst-case scenarios.
PROBLEM:
Upload tasks could be lost for extended periods when worker pods died
during retry processing due to:
- retries==0 check only ran on first attempt, skipping on pod failures
- 6-hour visibility timeout delayed recovery
- No automatic task recovery when workers died

SOLUTION:
1. Fixed Retry Logic
   - Removed retries==0 condition - checks processing on every retry
   - Added max retry limit (10) to prevent infinite loops
   - Enhanced logging with retry count tracking

2. Faster Pod Failure Recovery
   - Reduced visibility_timeout: 6 hours → 15 minutes (configurable)
   - Enabled task_reject_on_worker_lost for immediate re-queuing

3. Configuration Centralization
   - All resilience values in shared/celery_config.py
   - YAML/env override support with inline documentation
   - Links to official Celery/Redis docs in code comments

4. Enhanced Observability
   - worker_task_counts_retries_by_count - retry patterns
   - upload_task_processing_retry - processing lock retries
   - upload_task_too_many_retries - tasks hitting max retries
   - worker_task_counts_max_retries_exceeded - widespread issues

5. Generalized for All Tasks
   - Enhanced BaseCodecovTask.safe_retry() with metrics
   - Configurable max retries and exponential backoff
   - Available to all tasks for consistent retry behavior

IMPACT:
- Pod failures now recover in 15 minutes instead of 6 hours
- Retry logic works correctly after pod deaths
- All configuration centralized and documented
- New metrics enable alerting on stuck/failing uploads

NEXT STEPS (separate PRs):
- Dead Letter Queue for tasks exceeding max retries
- Prometheus alerting on new metrics

SAFETY:
- Backward compatible (no breaking changes)
- Race conditions prevented by existing Redis locks
- Rollback plan documented in PR description
…eout

1. Update test_run_impl_currently_processing_second_retry:
   - Test now verifies processing check runs on ANY retry (not just first)
   - Task should raise Retry exception when processing is ongoing
   - This properly tests the fix for the retries==0 bug

2. Update test_celery_config:
   - Update expected visibility_timeout: 21600 (6h) → 900 (15m)
   - Add comment explaining the change for pod failure recovery

Both changes align with the resilience improvements in the main PR.
Add test_run_impl_too_many_retries_logs_checkpoint to verify that
UploadFlow.TOO_MANY_RETRIES checkpoint is logged when a task hits
the max retry limit (10 retries) while processing is ongoing.

This addresses missing test coverage identified by code review for
the new retry limit behavior added in the resilience improvements.

The test verifies:
- maybe_log_upload_checkpoint is called with TOO_MANY_RETRIES
- Task returns with reason 'too_many_processing_retries'
- Task doesn't proceed with setup when max retries exceeded
# Can be overridden via: setup.tasks.celery.reject_on_worker_lost config
# Celery docs: https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-reject-on-worker-lost
task_reject_on_worker_lost = bool(
get_config("setup", "tasks", "celery", "reject_on_worker_lost", default=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs do mention "Enabling this can cause message loops", so maybe it's safer to have the default be False and activate it in specific tasks?

(like the Upload task that won't run in a loop forever because the arguments in redis eventually expire)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I see the IMPORTANT note above... not all tasks have acks_late set to True, do they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it clearer that there are two recovery mechanisms:
1. task_reject_on_worker_lost - immediate retry when pod dies
2. visibility_timeout (15 min) - safety net for lost/unacknowledged tasks

Previous wording was ambiguous about when 15 minutes applied.
The file contained detailed documentation on upload task resilience improvements, which have been fully integrated into the codebase and commit history. This cleanup helps streamline project documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants