Skip to content

Conversation

@sjs004
Copy link
Contributor

@sjs004 sjs004 commented Dec 8, 2025

Description

Optimizes remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged

Related Issues

Resolves #17530

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Changed
    • Metadata uploads during remote-store refreshes are now skipped when no new segments require uploading, avoiding redundant transfers.
    • The system now tracks last-uploaded state to detect unchanged conditions and prevent repeated uploads.
    • Deduplicated refresh behavior reduces logging noise and unnecessary work, improving efficiency and resource usage.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 8, 2025

Walkthrough

Caches last-uploaded metadata identifiers (primary term, segmentInfos generation, checkpoint version) in RemoteStoreRefreshListener (initialized to -1) and skips remote metadata upload during refresh when current values match the cached ones; precomputes translog file generation, reuses it for the upload call, and updates the caches after a successful remote metadata upload.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added entry: metadata upload is skipped for remote store clusters when no new segments require uploading during refresh.
Remote store listener
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java
Adds volatile cached fields lastUploadedPrimaryTerm, lastUploadedLuceneGeneration, lastUploadedCheckpointVersion (init -1L); fetches and null-checks translog generation early and computes translogFileGeneration; short-circuits upload when current primary term, lucene generation, and checkpoint version equal cached values; reuses precomputed translogFileGeneration in the upload call; updates caches after successful remoteDirectory.uploadMetadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Inspect thread-safety and visibility semantics of the three volatile fields.
  • Verify correctness and atomicity of the deduplication condition to avoid skipping necessary uploads.
  • Confirm translog generation retrieval, null-check, and computed translogFileGeneration are correct and used consistently.
  • Ensure caches are only updated after a confirmed successful remoteDirectory.uploadMetadata and that error paths don't corrupt state.

Poem

🐇 I watched my terms and gens beneath the moon,
I counted checkpoints softly, hummed a tune,
If all aligned, I kept my packet tight,
No clumsy upload broke the quiet night,
🥕 A nimble hop, a cached and tidy byte.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description is minimal and somewhat vague. While it mentions the optimization objective, it lacks specific details about the implementation approach and current state. Expand the description to explain: (1) the deduplication mechanism using lastUploaded fields, (2) how it avoids unnecessary metadata uploads, (3) current test failures and their status, and (4) whether this PR is ready for merge or still in development.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: optimizing remote store refresh by skipping unnecessary metadata uploads when index state is unchanged, which aligns with the CHANGELOG entry and code modifications.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java (2)

404-405: Relaxed successLatch counts align with reduced engine calls, but only assert “at least once”

Lowering the latch from 3 to 1 in the success‑on‑first/second/third‑attempt tests matches the expectation that IndexShard.getEngine() is now invoked fewer times in the success path. Note that with CountDownLatch this still only enforces “called at least once”, not “exactly once”; if you ever need to pin down the exact call count, a mock/spy verification or an AtomicLong assertion would be more precise.

Also applies to: 426-427, 499-500


875-888: verifyUploadedSegments now tolerates missing metadata entries

Allowing files to be absent from uploadedSegments and only validating metadata when present makes the helper compatible with runs where metadata upload may have been legitimately skipped. However, this also weakens the check: it no longer guarantees that every non‑excluded local file has corresponding remote metadata, only that any present entries are sane and that a local segments_N file exists.

If the intent is to keep strong guarantees for cases where uploads are expected (e.g., the various testAfter* flows), consider optionally asserting that at least one non‑excluded file appears in uploadedSegments or adding a dedicated assertion in those tests while leaving this helper looser for optimization cases.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99695fe and 6b48ac9.

📒 Files selected for processing (3)
  • CHANGELOG.md (1 hunks)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1 hunks)
  • server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (210)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
  • GitHub Check: detect-breaking-change
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
🔇 Additional comments (3)
CHANGELOG.md (1)

27-27: Changelog entry accurately reflects behavior change

Description and PR link look correct and match the implementation in RemoteStoreRefreshListener. No further changes needed.

server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

270-276: Metadata upload is now correctly gated on “any new segments”

Reusing skipUpload to decide whether all post‑refresh files are already present remotely is consistent with the existing upload filter and isRemoteSegmentStoreInSync(). Given containsFile is keyed off remote metadata, this preserves the invariant that metadata is uploaded whenever new segments exist, and skipped only when the remote store is already in sync. Looks good.

server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java (1)

908-927: testUploadsWhenNewSegmentsPresent correctly guards against regressions in segment uploads

This integration test validates that adding a new document leads to additional uploaded segments, using assertBusy to accommodate async behavior. It’s a good complement to the optimization: ensures we didn’t inadvertently suppress uploads when new segments are present.

Comment on lines +890 to +907
public void testSkipsMetadataUploadWhenNoNewSegments() throws IOException {
setup(true, 2);

indexShard.refresh("initial-upload");

try (Store remoteStore = indexShard.remoteStore()) {
RemoteSegmentStoreDirectory remoteSegmentStoreDirectory =
(RemoteSegmentStoreDirectory) ((FilterDirectory) ((FilterDirectory) remoteStore.directory()).getDelegate()).getDelegate();

final int initialCount = remoteSegmentStoreDirectory.getSegmentsUploadedToRemoteStore().size();

indexShard.refresh("test-optimization");

int afterOptimizationCount = remoteSegmentStoreDirectory.getSegmentsUploadedToRemoteStore().size();
assertEquals("No new segments should be uploaded when optimization kicks in", initialCount, afterOptimizationCount);
}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

testSkipsMetadataUploadWhenNoNewSegments doesn’t currently prove metadata uploads are skipped

This test asserts that the number of entries in getSegmentsUploadedToRemoteStore() is unchanged across two refreshes with no new docs. That condition should already hold even without the new optimization, since no additional segment blobs are uploaded when there are no new segments.

If you want this test to specifically guard the “skip metadata upload” behavior, you may want to instrument or spy on the remote directory (e.g., verify uploadMetadata is not called on the second refresh) instead of or in addition to checking the segment count.

🤖 Prompt for AI Agents
In
server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java
around lines 890 to 907, the test only checks segment count which does not prove
metadata upload was skipped; replace or augment the test to instrument/spy the
remote directory and assert that its metadata-upload method is not invoked on
the second refresh. Specifically, obtain a spy or mock of the
RemoteSegmentStoreDirectory (or the remote directory wrapper), run the first
refresh and verify uploadMetadata was called once, run the second refresh and
verify uploadMetadata was NOT called (or that its invocation count did not
increase); use your test framework’s mocking utilities (e.g., Mockito
spy/verify) to assert the expected call counts and keep the existing
segment-count assertion if desired.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

❌ Gradle check result for 7d7e4f1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sjs004
Copy link
Contributor Author

sjs004 commented Dec 9, 2025

@sandeshkr419

I tried to fix Test failures for this PR but I believe changes maybe incomplete. Snapshot State post this change is coming as PARTIAL instead of SUCCESS. I believe it should be SUCCESS even if metadata upload is skipped in some scenarios. PARTIAL is not the right terminal state. I tried to bypass this check to see more issues & got more assertion failure as snapshotInfo.failedShards() was greater than zero. So handling for SnapshotInfo in this scenario should be changed IMO. I will continue to dig deeper

REPRODUCE WITH: ./gradlew ':server:test' --tests 'org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2' -Dtests.seed=127D06CE3CE0581A -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=sr-Latn -Dtests.timezone=America/Anguilla -Druntime.java=24

BlobStoreRepositoryRemoteIndexTests > testRetrieveShallowCopySnapshotCase2 FAILED
    java.lang.AssertionError: 
    Expected: is <SUCCESS>
         but: was <PARTIAL>
        at __randomizedtesting.SeedInfo.seed([127D06CE3CE0581A:5B161C4239EDE0D3]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
        at org.apache.lucene.tests.util.LuceneTestCase.assertThat(LuceneTestCase.java:2104)
        at org.opensearch.repositories.blobstore.BlobStoreRepositoryHelperTests.createSnapshot(BlobStoreRepositoryHelperTests.java:141)
        at org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2(BlobStoreRepositoryRemoteIndexTests.java:309)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

94-96: Consider initializing cache fields from remote metadata.

The cache fields are not initialized from remoteSegmentMetadata in the constructor (line 122), which could lead to a redundant metadata upload on the first refresh even when identical metadata already exists in the remote store. While this is not incorrect (the deduplication check will eventually work after the first upload), initializing these fields from remoteSegmentMetadata (similar to how primaryTerm is initialized) would optimize the initial state.

Apply this diff to initialize the cache from remote metadata:

 // initializing primary term with the primary term of latest metadata in remote store.
 // if no metadata is present, this value will be initialized with -1.
 this.primaryTerm = remoteSegmentMetadata != null ? remoteSegmentMetadata.getPrimaryTerm() : INVALID_PRIMARY_TERM;
+if (remoteSegmentMetadata != null) {
+    this.lastUploadedPrimaryTerm = remoteSegmentMetadata.getPrimaryTerm();
+    this.lastUploadedGeneration = remoteSegmentMetadata.getGeneration();
+    this.lastUploadedTranslogGeneration = remoteSegmentMetadata.getTranslogGeneration();
+}
 this.segmentTracker = segmentTracker;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f2730d and ab9354c.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: check-files
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

443-469: Deduplication logic is sound, but undermined by line 273.

The deduplication check here correctly compares primary term, SegmentInfos generation, and translog generation to determine whether metadata has changed. This logic properly handles:

  • Primary term changes (will upload)
  • Segment changes via generation (will upload)
  • Translog advancement via translog generation (will upload)

The cache update after successful upload (lines 467-469) correctly tracks the last uploaded state.

However, this well-designed deduplication logic is currently being undermined by the check on line 273, which bypasses the call to uploadMetadata entirely when all segment files are unchanged, preventing this method from evaluating translog generation changes.

Note: Once the critical issue on line 273 is resolved, this deduplication logic will function correctly and provide the intended optimization.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (2)

94-96: Consider resetting cache fields on primary term change.

While the deduplication check at lines 445-447 includes lastUploadedPrimaryTerm, which correctly handles primary term transitions, explicitly resetting these cache fields in initializeRemoteDirectoryOnTermUpdate (around line 549) would make the invalidation logic more transparent and resilient to future modifications.

Apply this diff to reset the cache on primary term change:

     private void initializeRemoteDirectoryOnTermUpdate() throws IOException {
         if (this.primaryTerm != indexShard.getOperationPrimaryTerm()) {
             logger.trace("primaryTerm update from={} to={}", primaryTerm, indexShard.getOperationPrimaryTerm());
             this.primaryTerm = indexShard.getOperationPrimaryTerm();
+            // Reset cache on primary term change
+            this.lastUploadedPrimaryTerm = INVALID_PRIMARY_TERM;
+            this.lastUploadedGeneration = -1;
+            this.lastUploadedTranslogGeneration = -1;
             RemoteSegmentMetadata uploadedMetadata = this.remoteDirectory.init();

440-444: Minor: Translog generation retrieved before deduplication check.

Lines 440-444 retrieve and validate the translog generation before the deduplication check at lines 445-451. If deduplication skips the upload (returns early at line 450), this retrieval work is wasted.

For a minor optimization, consider moving the translog generation retrieval after the deduplication check, but this requires restructuring the cache to store whether translog generation has been retrieved. Given the complexity and minimal performance gain, this is a low-priority refinement.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab9354c and 9f5abc5.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

445-451: Critical: Verify snapshot compatibility with deduplication logic.

The review raises concerns that the deduplication check (lines 445-451) may break snapshot consistency, evidenced by test failure showing snapshot state becoming PARTIAL instead of SUCCESS. Before merging:

  1. Confirm the test failure: Verify that BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2 reproduces the PARTIAL snapshot state with this deduplication logic.

  2. Validate state tracking completeness: Confirm that primary term, segment generation, and translog generation are sufficient to deduplicate metadata uploads, and that maxSeqNo cannot advance independently without triggering one of these values.

  3. Understand snapshot protocol: Investigate whether snapshot operations depend on metadata uploads as synchronization points or coordination signals.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

❌ Gradle check result for 9f5abc5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

275-281: Delete the commented-out legacy metadata-upload branch

This entire if (localSegmentsPostRefresh.stream().allMatch(...)) block is now preserved only as comments and duplicates the logic that used to sit here. Version control already captures that history, so keeping it commented just adds noise around the core happy-path (always calling uploadMetadata). Since a prior review already requested this cleanup, it’s better to delete it outright.

-                                // if (localSegmentsPostRefresh.stream().allMatch(file -> skipUpload(file))) {
-                                // logger.debug("Skipping metadata upload - no new segments were uploaded");
-                                // } else {
-                                // // Start metadata file upload
-                                // uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint);
-                                // logger.debug("Metadata upload successful");
-                                // }
🧹 Nitpick comments (2)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (2)

94-96: Clarify sentinel initialization comment or align constants for all cached fields

The inline comment // Use constant or -1 on lastUploadedPrimaryTerm is ambiguous now that INVALID_PRIMARY_TERM is already in use, while the generation/translog fields still use raw -1. Either remove the comment or make the intent explicit (e.g., introduce analogous constants for “no generation/no translog generation” and reference them here) so future readers aren’t left wondering if this is a TODO.


439-472: Validate deduplication key invariants and reconsider exception type for null translog generation

The new dedup guard in uploadMetadata looks like the right place to centralize “skip if nothing has changed” logic, and keying it off (primaryTerm, segmentInfos.getGeneration(), translogFileGeneration) is reasonable. A couple of points are worth double‑checking:

  • Dedup key completeness: Please confirm there is no scenario where the metadata content (notably the checkpoint/userData you derive via currentOngoingRefreshCheckpoint()) can change while these three identifiers remain the same. If such a scenario exists, this guard would incorrectly skip uploads and could explain situations like the PARTIAL snapshot state you’re seeing. If the invariant does hold (“metadata is entirely determined by primaryTerm + segment gen + translog generation”), consider adding a short comment documenting that assumption to prevent future regressions when uploadMetadata evolves.

  • Null TranslogGeneration handling: Throwing UnsupportedOperationException for a null TranslogGeneration reads more like “this API is not supported” than “we hit an impossible runtime state.” If remote store is only expected to run with a non‑null translog generation, IllegalStateException (or even an assertion) would better communicate that this is a broken invariant rather than an unsupported feature.

If you agree, updating the exception type and adding a brief comment on the dedup invariant would tighten this logic without changing behavior.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f5abc5 and d285df8.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: check-files

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

❌ Gradle check result for d285df8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

276-282: Remove the commented-out metadata-skip branch.

This entire block is dead code that duplicates the now-superseded skip logic and is already captured in git history. Keeping it commented out adds noise and risks confusion if future changes diverge from it. It’s safe to delete these lines outright.

🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

94-97: Clarify sentinel usage and remove the inline “Use constant or -1” note.

The new lastUploaded* fields fit the dedup logic, but the mixed use of INVALID_PRIMARY_TERM and raw -1 plus the inline “Use constant or -1” comment is a bit confusing. Consider either:

  • Introducing dedicated sentinel constants for generation/translog/maxSeqNo (similar to INVALID_PRIMARY_TERM), or
  • Adding a short field-level comment explaining that -1 represents “no prior upload” and dropping the inline TODO-style note.

This keeps the invariants obvious and removes the magic numbers.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d285df8 and 02dee94.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, macos-15)
  • GitHub Check: detect-breaking-change
  • GitHub Check: Analyze (java)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

440-475: Dedup guard and metadata upload flow look correct for the translog/seqNo case.

Using the tuple (primary term, Lucene generation, translog file generation, maxSeqNo) to decide whether to skip metadata aligns with the goal of still uploading when only the translog/maxSeqNo advance (no new segment files). The null check on getTranslogGeneration() plus updating the lastUploaded* cache only after a successful uploadMetadata call also look sound for retries and avoids stale-cache issues.

I don’t see functional problems in this block; it should fix the earlier “skip when only translog advanced” bug while avoiding redundant uploads.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

❌ Gradle check result for 02dee94: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)

275-281: Remove commented-out code.

The commented-out lines should be deleted entirely. Version control maintains the history.

Apply this diff:

                             try {
                                 logger.debug("New segments upload successful");
                                 uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint);
                                 logger.debug("Metadata upload successful");
-                                // if (localSegmentsPostRefresh.stream().allMatch(file -> skipUpload(file))) {
-                                // logger.debug("Skipping metadata upload - no new segments were uploaded");
-                                // } else {
-                                // // Start metadata file upload
-                                // uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint);
-                                // logger.debug("Metadata upload successful");
-                                // }
                                 clearStaleFilesFromLocalSegmentChecksumMap(localSegmentsPostRefresh);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02dee94 and 3715f48.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: gradle-check

Comment on lines 438 to 475

final long maxSeqNo = ((InternalEngine) indexShard.getEngine()).currentOngoingRefreshCheckpoint();
Translog.TranslogGeneration translogGeneration = indexShard.getEngine().translogManager().getTranslogGeneration();
if (translogGeneration == null) {
throw new UnsupportedOperationException("Encountered null TranslogGeneration while uploading metadata to remote segment store");
}
long translogFileGeneration = translogGeneration.translogFileGeneration;
final long currentPrimaryTerm = replicationCheckpoint.getPrimaryTerm();
final long currentLuceneGeneration = segmentInfos.getGeneration();
final long currentCheckpointVersion = replicationCheckpoint.getSegmentInfosVersion();
if (this.lastUploadedPrimaryTerm == currentPrimaryTerm
&& this.lastUploadedLuceneGeneration == currentLuceneGeneration
&& this.lastUploadedCheckpointVersion == currentCheckpointVersion) {

// LOGIC: The index state (segment files, primary authority, and replication progress)
// is identical to the last state successfully uploaded. Skip the costly remote I/O.
return;
}

SegmentInfos segmentInfosSnapshot = segmentInfos.clone();
Map<String, String> userData = segmentInfosSnapshot.getUserData();
userData.put(LOCAL_CHECKPOINT_KEY, String.valueOf(maxSeqNo));
userData.put(SequenceNumbers.MAX_SEQ_NO, Long.toString(maxSeqNo));
segmentInfosSnapshot.setUserData(userData, false);

Translog.TranslogGeneration translogGeneration = indexShard.getEngine().translogManager().getTranslogGeneration();
if (translogGeneration == null) {
throw new UnsupportedOperationException("Encountered null TranslogGeneration while uploading metadata to remote segment store");
} else {
long translogFileGeneration = translogGeneration.translogFileGeneration;
remoteDirectory.uploadMetadata(
localSegmentsPostRefresh,
segmentInfosSnapshot,
storeDirectory,
translogFileGeneration,
replicationCheckpoint,
indexShard.getNodeId()
);
}
remoteDirectory.uploadMetadata(
localSegmentsPostRefresh,
segmentInfosSnapshot,
storeDirectory,
translogFileGeneration,
replicationCheckpoint,
indexShard.getNodeId()
);

this.lastUploadedPrimaryTerm = currentPrimaryTerm;
this.lastUploadedLuceneGeneration = currentLuceneGeneration;
this.lastUploadedCheckpointVersion = currentCheckpointVersion;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Deduplication check must include translog generation.

The deduplication logic (lines 448-450) checks primaryTerm, luceneGeneration, and checkpointVersion, but omits translogFileGeneration. When the translog advances without creating new segments (e.g., documents indexed but not yet flushed), the three checked values remain unchanged, causing the method to return early (line 454) and skip the metadata upload. This leaves the remote store with stale translog generation information.

This likely explains the test failure in the PR description: the snapshot expects up-to-date metadata reflecting the current translog state but instead finds stale information, resulting in a PARTIAL snapshot state instead of SUCCESS.

To fix this, add lastUploadedTranslogGeneration tracking and include it in the deduplication check:

     private volatile long lastUploadedPrimaryTerm = -1L;
     private volatile long lastUploadedLuceneGeneration = -1L;
     private volatile long lastUploadedCheckpointVersion = -1L;
+    private volatile long lastUploadedTranslogGeneration = -1L;

Then update the deduplication check:

         final long currentPrimaryTerm = replicationCheckpoint.getPrimaryTerm();
         final long currentLuceneGeneration = segmentInfos.getGeneration();
         final long currentCheckpointVersion = replicationCheckpoint.getSegmentInfosVersion();
         if (this.lastUploadedPrimaryTerm == currentPrimaryTerm
             && this.lastUploadedLuceneGeneration == currentLuceneGeneration
-            && this.lastUploadedCheckpointVersion == currentCheckpointVersion) {
+            && this.lastUploadedCheckpointVersion == currentCheckpointVersion
+            && this.lastUploadedTranslogGeneration == translogFileGeneration) {
 
             // LOGIC: The index state (segment files, primary authority, and replication progress)
             // is identical to the last state successfully uploaded. Skip the costly remote I/O.
             return;
         }

And update the cache after successful upload:

         this.lastUploadedPrimaryTerm = currentPrimaryTerm;
         this.lastUploadedLuceneGeneration = currentLuceneGeneration;
         this.lastUploadedCheckpointVersion = currentCheckpointVersion;
+        this.lastUploadedTranslogGeneration = translogFileGeneration;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java
around lines 438 to 475, the deduplication check omits the translog file
generation so uploads are skipped when translog advances without new segments;
add a long field lastUploadedTranslogGeneration to the class, include a
comparison of lastUploadedTranslogGeneration == translogFileGeneration in the
if-condition alongside primaryTerm, luceneGeneration and checkpointVersion, and
after a successful remoteDirectory.uploadMetadata(...) set
this.lastUploadedTranslogGeneration = translogFileGeneration (ensure the field
is initialized to a sentinel like -1L so first upload proceeds).

@github-actions
Copy link
Contributor

❌ Gradle check result for 98319a3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 44dab6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for e2fec2b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❕ Gradle check result for ba4c4bf: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.20%. Comparing base (9f8d381) to head (ba4c4bf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...search/index/shard/RemoteStoreRefreshListener.java 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20192      +/-   ##
============================================
- Coverage     73.30%   73.20%   -0.11%     
+ Complexity    71804    71727      -77     
============================================
  Files          5793     5793              
  Lines        328160   328184      +24     
  Branches      47266    47268       +2     
============================================
- Hits         240559   240231     -328     
- Misses        68295    68689     +394     
+ Partials      19306    19264      -42     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sjs004 sjs004 changed the title Pr 19198: Skip metadata upload when no new segments during refresh Optimizes remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged Dec 10, 2025
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Remote labels Dec 10, 2025
@sjs004 sjs004 marked this pull request as draft December 10, 2025 18:24
@sjs004 sjs004 changed the title Optimizes remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged Optimize remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request good first issue Good for newcomers Storage:Remote

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Skip metadata upload if no new segment files to upload during refresh for remote store clusters

2 participants