Skip to content

Conversation

@shivansh-gohem
Copy link

Problem

When block upload fails mid-operation with a transient error (e.g., S3 500 Internal Server Error):

  1. Chunks directory gets uploaded to object storage ✓
  2. Index file gets uploaded to object storage ✓
  3. meta.json upload FAILS
  4. Error is returned but local block is kept (for retry) ✓
  5. BUT: Partial block (no meta.json) remains in object storage ✗
  6. Syncer discovers it as "partial upload"
  7. After 48 hours, BestEffortCleanAbortedPartialUploads() deletes it permanently 💥
  8. Result: Hours of metrics data lost permanently

Root Cause: upload() function in pkg/block/block.go had a comment saying "It makes sure cleanup is done on error to avoid partial block uploads" but NO cleanup code actually existed.


Solution

Added defer-based cleanup logic that:

  • Tracks when upload starts with uploadStarted flag
  • On any upload error, immediately calls Delete() to clean partial block from object storage
  • Uses separate 5-minute timeout context for cleanup (prevents timeout propagation)
  • Logs warnings/errors for operational visibility
  • Keeps local block (allows retry on next cycle)

Changes

File: pkg/block/block.go

  • Modified upload() function signature to use named return (err error)
  • Added uploadStarted flag tracking
  • Added defer cleanup block that:
    • Checks if error occurred AND upload had started
    • Calls Delete() with isolated context
    • Logs cleanup attempts for observability

File: pkg/block/block_test.go

  • Updated TestUploadCleanup expectations:
    • Index upload failure: expects 0 objects (was 2) - cleanup works ✓
    • Meta.json upload failure: expects 0 objects (was 3) - cleanup works ✓
  • Removed assertions about partial files existing (they're cleaned up now)

Testing

  • TestUploadCleanup - verifies partial block cleanup on failure
  • ✅ All pkg/block tests passing
  • ✅ Binary builds successfully
  • ✅ No regressions in existing functionality

Impact

  • Before: Data loss after 48 hours when transient upload errors occur
  • After: Immediate cleanup prevents orphaned blocks, data preserved on retry

Related Issues

Fixes #8548


Fixes thanos-io#8548. When block upload fails after chunks/index files are uploaded to object storage but before meta.json completes, a partial block is left behind. The syncer later marks it as 'partial' and BestEffortCleanAbortedPartialUploads deletes it after 48 hours, causing permanent data loss.

This fix ensures immediate cleanup of partial blocks by:
- Adding uploadStarted flag to track when upload begins
- Using defer to cleanup on error before returning
- Calling Delete() with a separate 5min timeout context
- Logging warnings when cleanup occurs for observability

The local block is still retained (allowing retry) while the remote object storage copy is cleaned up immediately.

Signed-off-by: Shivansh Sahu <[email protected]>
@shivansh-gohem
Copy link
Author

Hey maintainers, All tests relevant to my changes in pkg/block pass locally, including the newly added TestUploadCleanup, verifying the fix works as intended.

However, the CI is currently failing required unit tests related to cloud backend storages (GCS, AWS S3, Azure, etc.) due to missing credentials and configuration in the CI environment. These failures are unrelated to this patch.

Additionally, the documentation check is failing due to broken external links in legacy docs that this PR does not modify.

Please advise on how to proceed with the required cloud backend tests. If needed, I am happy to help with a follow-up PR to improve test mocking or environment configuration.

Looking forward to your feedback!

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uploading function should not be removing anything. We do this on purpose because it is much harder to reason about the system when you upload, delete, upload, etc. We have functionality that removes partially uploaded blocks. Have you seen it?

@shivansh-gohem
Copy link
Author

Thank you for pointing me to BestEffortCleanAbortedPartialUploads()! I've reviewed the existing cleanup mechanism in pkg/compact/clean.go.

I see that partial blocks (those without meta.json) are cleaned up after 48 hours (PartialUploadThresholdAge). The issue described in #8548 occurs when:

Block upload fails transiently after uploading chunks and index but before uploading meta.json

The partial block remains in object storage for 48 hours

After 48 hours, BestEffortCleanAbortedPartialUploads() deletes it permanently

This causes data loss since the block can't be recovered

I understand now that adding cleanup logic to upload() violates the separation of concerns. However, I'm not sure what the correct fix should be. Could you help clarify:

Should partial uploads be retried immediately? Is the expectation that transient failures should trigger an immediate retry within the same compaction cycle?

Should the 48-hour threshold be adjusted? Or is there a way to distinguish between "truly aborted" uploads vs. "temporarily failed but will retry" uploads?

Is the root cause that uploads aren't being retried properly? Should I focus on improving retry logic instead of cleanup?

I want to make sure I address the data loss issue while respecting Thanos's design principles. What approach would you recommend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

thanos compactor removing metrics if upload failed with error "upload s3 object: We encountered an internal error, please try again"

2 participants