fix: prevent compactor from deleting blocks on transient upload failures #8554

shivansh-gohem · 2025-11-07T13:32:16Z

Problem

When block upload fails mid-operation with a transient error (e.g., S3 500 Internal Server Error):

Chunks directory gets uploaded to object storage ✓
Index file gets uploaded to object storage ✓
meta.json upload FAILS ✗
Error is returned but local block is kept (for retry) ✓
BUT: Partial block (no meta.json) remains in object storage ✗
Syncer discovers it as "partial upload"
After 48 hours, BestEffortCleanAbortedPartialUploads() deletes it permanently 💥
Result: Hours of metrics data lost permanently

Root Cause: upload() function in pkg/block/block.go had a comment saying "It makes sure cleanup is done on error to avoid partial block uploads" but NO cleanup code actually existed.

Solution

Added defer-based cleanup logic that:

Tracks when upload starts with uploadStarted flag
On any upload error, immediately calls Delete() to clean partial block from object storage
Uses separate 5-minute timeout context for cleanup (prevents timeout propagation)
Logs warnings/errors for operational visibility
Keeps local block (allows retry on next cycle)

Changes

File: pkg/block/block.go

Modified upload() function signature to use named return (err error)
Added uploadStarted flag tracking
Added defer cleanup block that:
- Checks if error occurred AND upload had started
- Calls Delete() with isolated context
- Logs cleanup attempts for observability

File: pkg/block/block_test.go

Updated TestUploadCleanup expectations:
- Index upload failure: expects 0 objects (was 2) - cleanup works ✓
- Meta.json upload failure: expects 0 objects (was 3) - cleanup works ✓
Removed assertions about partial files existing (they're cleaned up now)

Testing

✅ TestUploadCleanup - verifies partial block cleanup on failure
✅ All pkg/block tests passing
✅ Binary builds successfully
✅ No regressions in existing functionality

Impact

Before: Data loss after 48 hours when transient upload errors occur
After: Immediate cleanup prevents orphaned blocks, data preserved on retry

Related Issues

Fixes #8548

Fixes thanos-io#8548. When block upload fails after chunks/index files are uploaded to object storage but before meta.json completes, a partial block is left behind. The syncer later marks it as 'partial' and BestEffortCleanAbortedPartialUploads deletes it after 48 hours, causing permanent data loss. This fix ensures immediate cleanup of partial blocks by: - Adding uploadStarted flag to track when upload begins - Using defer to cleanup on error before returning - Calling Delete() with a separate 5min timeout context - Logging warnings when cleanup occurs for observability The local block is still retained (allowing retry) while the remote object storage copy is cleaned up immediately. Signed-off-by: Shivansh Sahu <[email protected]>

Signed-off-by: Shivansh Sahu <[email protected]>

shivansh-gohem · 2025-11-07T15:42:14Z

Hey maintainers, All tests relevant to my changes in pkg/block pass locally, including the newly added TestUploadCleanup, verifying the fix works as intended.

However, the CI is currently failing required unit tests related to cloud backend storages (GCS, AWS S3, Azure, etc.) due to missing credentials and configuration in the CI environment. These failures are unrelated to this patch.

Additionally, the documentation check is failing due to broken external links in legacy docs that this PR does not modify.

Please advise on how to proceed with the required cloud backend tests. If needed, I am happy to help with a follow-up PR to improve test mocking or environment configuration.

Looking forward to your feedback!

GiedriusS

Uploading function should not be removing anything. We do this on purpose because it is much harder to reason about the system when you upload, delete, upload, etc. We have functionality that removes partially uploaded blocks. Have you seen it?

shivansh-gohem · 2025-11-11T11:29:26Z

Thank you for pointing me to BestEffortCleanAbortedPartialUploads()! I've reviewed the existing cleanup mechanism in pkg/compact/clean.go.

I see that partial blocks (those without meta.json) are cleaned up after 48 hours (PartialUploadThresholdAge). The issue described in #8548 occurs when:

Block upload fails transiently after uploading chunks and index but before uploading meta.json

The partial block remains in object storage for 48 hours

After 48 hours, BestEffortCleanAbortedPartialUploads() deletes it permanently

This causes data loss since the block can't be recovered

I understand now that adding cleanup logic to upload() violates the separation of concerns. However, I'm not sure what the correct fix should be. Could you help clarify:

Should partial uploads be retried immediately? Is the expectation that transient failures should trigger an immediate retry within the same compaction cycle?

Should the 48-hour threshold be adjusted? Or is there a way to distinguish between "truly aborted" uploads vs. "temporarily failed but will retry" uploads?

Is the root cause that uploads aren't being retried properly? Should I focus on improving retry logic instead of cleanup?

I want to make sure I address the data loss issue while respecting Thanos's design principles. What approach would you recommend?

pull-request-size bot added the size/M label Nov 7, 2025

docs: acknowledge external broken links causing doc check failure

bf42379

Signed-off-by: Shivansh Sahu <[email protected]>

GiedriusS requested changes Nov 11, 2025

View reviewed changes

GiedriusS closed this Nov 11, 2025

shivansh-gohem mentioned this pull request Nov 11, 2025

thanos compactor removing metrics if upload failed with error "upload s3 object: We encountered an internal error, please try again" #8548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent compactor from deleting blocks on transient upload failures #8554

fix: prevent compactor from deleting blocks on transient upload failures #8554

Uh oh!

shivansh-gohem commented Nov 7, 2025

Uh oh!

shivansh-gohem commented Nov 7, 2025

Uh oh!

GiedriusS left a comment

Uh oh!

shivansh-gohem commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: prevent compactor from deleting blocks on transient upload failures #8554

fix: prevent compactor from deleting blocks on transient upload failures #8554

Uh oh!

Conversation

shivansh-gohem commented Nov 7, 2025

Problem

Solution

Changes

Testing

Impact

Related Issues

Uh oh!

shivansh-gohem commented Nov 7, 2025

Uh oh!

GiedriusS left a comment

Choose a reason for hiding this comment

Uh oh!

shivansh-gohem commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants