-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Optimize remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged #20192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughCaches last-uploaded metadata identifiers (primary term, segmentInfos generation, checkpoint version) in Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java (2)
404-405: RelaxedsuccessLatchcounts align with reduced engine calls, but only assert “at least once”Lowering the latch from 3 to 1 in the success‑on‑first/second/third‑attempt tests matches the expectation that
IndexShard.getEngine()is now invoked fewer times in the success path. Note that withCountDownLatchthis still only enforces “called at least once”, not “exactly once”; if you ever need to pin down the exact call count, a mock/spy verification or anAtomicLongassertion would be more precise.Also applies to: 426-427, 499-500
875-888:verifyUploadedSegmentsnow tolerates missing metadata entriesAllowing files to be absent from
uploadedSegmentsand only validating metadata when present makes the helper compatible with runs where metadata upload may have been legitimately skipped. However, this also weakens the check: it no longer guarantees that every non‑excluded local file has corresponding remote metadata, only that any present entries are sane and that a localsegments_Nfile exists.If the intent is to keep strong guarantees for cases where uploads are expected (e.g., the various
testAfter*flows), consider optionally asserting that at least one non‑excluded file appears inuploadedSegmentsor adding a dedicated assertion in those tests while leaving this helper looser for optimization cases.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.md(1 hunks)server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(1 hunks)server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (210)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
🔇 Additional comments (3)
CHANGELOG.md (1)
27-27: Changelog entry accurately reflects behavior changeDescription and PR link look correct and match the implementation in
RemoteStoreRefreshListener. No further changes needed.server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
270-276: Metadata upload is now correctly gated on “any new segments”Reusing
skipUploadto decide whether all post‑refresh files are already present remotely is consistent with the existing upload filter andisRemoteSegmentStoreInSync(). GivencontainsFileis keyed off remote metadata, this preserves the invariant that metadata is uploaded whenever new segments exist, and skipped only when the remote store is already in sync. Looks good.server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java (1)
908-927:testUploadsWhenNewSegmentsPresentcorrectly guards against regressions in segment uploadsThis integration test validates that adding a new document leads to additional uploaded segments, using
assertBusyto accommodate async behavior. It’s a good complement to the optimization: ensures we didn’t inadvertently suppress uploads when new segments are present.
| public void testSkipsMetadataUploadWhenNoNewSegments() throws IOException { | ||
| setup(true, 2); | ||
|
|
||
| indexShard.refresh("initial-upload"); | ||
|
|
||
| try (Store remoteStore = indexShard.remoteStore()) { | ||
| RemoteSegmentStoreDirectory remoteSegmentStoreDirectory = | ||
| (RemoteSegmentStoreDirectory) ((FilterDirectory) ((FilterDirectory) remoteStore.directory()).getDelegate()).getDelegate(); | ||
|
|
||
| final int initialCount = remoteSegmentStoreDirectory.getSegmentsUploadedToRemoteStore().size(); | ||
|
|
||
| indexShard.refresh("test-optimization"); | ||
|
|
||
| int afterOptimizationCount = remoteSegmentStoreDirectory.getSegmentsUploadedToRemoteStore().size(); | ||
| assertEquals("No new segments should be uploaded when optimization kicks in", initialCount, afterOptimizationCount); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testSkipsMetadataUploadWhenNoNewSegments doesn’t currently prove metadata uploads are skipped
This test asserts that the number of entries in getSegmentsUploadedToRemoteStore() is unchanged across two refreshes with no new docs. That condition should already hold even without the new optimization, since no additional segment blobs are uploaded when there are no new segments.
If you want this test to specifically guard the “skip metadata upload” behavior, you may want to instrument or spy on the remote directory (e.g., verify uploadMetadata is not called on the second refresh) instead of or in addition to checking the segment count.
🤖 Prompt for AI Agents
In
server/src/test/java/org/opensearch/index/shard/RemoteStoreRefreshListenerTests.java
around lines 890 to 907, the test only checks segment count which does not prove
metadata upload was skipped; replace or augment the test to instrument/spy the
remote directory and assert that its metadata-upload method is not invoked on
the second refresh. Specifically, obtain a spy or mock of the
RemoteSegmentStoreDirectory (or the remote directory wrapper), run the first
refresh and verify uploadMetadata was called once, run the second refresh and
verify uploadMetadata was NOT called (or that its invocation count did not
increase); use your test framework’s mocking utilities (e.g., Mockito
spy/verify) to assert the expected call counts and keep the existing
segment-count assertion if desired.
|
❌ Gradle check result for 7d7e4f1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
I tried to fix Test failures for this PR but I believe changes maybe incomplete. Snapshot State post this change is coming as |
Signed-off-by: Hyunsang Han <[email protected]>
Signed-off-by: Hyunsang Han <[email protected]>
Signed-off-by: Hyunsang Han <[email protected]>
…WhenNewSegmentsPresent Signed-off-by: Hyunsang Han <[email protected]>
Signed-off-by: Hyunsang Han <[email protected]>
Signed-off-by: sjs004 <[email protected]>
Signed-off-by: sjs004 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
94-96: Consider initializing cache fields from remote metadata.The cache fields are not initialized from
remoteSegmentMetadatain the constructor (line 122), which could lead to a redundant metadata upload on the first refresh even when identical metadata already exists in the remote store. While this is not incorrect (the deduplication check will eventually work after the first upload), initializing these fields fromremoteSegmentMetadata(similar to howprimaryTermis initialized) would optimize the initial state.Apply this diff to initialize the cache from remote metadata:
// initializing primary term with the primary term of latest metadata in remote store. // if no metadata is present, this value will be initialized with -1. this.primaryTerm = remoteSegmentMetadata != null ? remoteSegmentMetadata.getPrimaryTerm() : INVALID_PRIMARY_TERM; +if (remoteSegmentMetadata != null) { + this.lastUploadedPrimaryTerm = remoteSegmentMetadata.getPrimaryTerm(); + this.lastUploadedGeneration = remoteSegmentMetadata.getGeneration(); + this.lastUploadedTranslogGeneration = remoteSegmentMetadata.getTranslogGeneration(); +} this.segmentTracker = segmentTracker;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: check-files
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
443-469: Deduplication logic is sound, but undermined by line 273.The deduplication check here correctly compares primary term, SegmentInfos generation, and translog generation to determine whether metadata has changed. This logic properly handles:
- Primary term changes (will upload)
- Segment changes via generation (will upload)
- Translog advancement via translog generation (will upload)
The cache update after successful upload (lines 467-469) correctly tracks the last uploaded state.
However, this well-designed deduplication logic is currently being undermined by the check on line 273, which bypasses the call to
uploadMetadataentirely when all segment files are unchanged, preventing this method from evaluating translog generation changes.Note: Once the critical issue on line 273 is resolved, this deduplication logic will function correctly and provide the intended optimization.
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (2)
94-96: Consider resetting cache fields on primary term change.While the deduplication check at lines 445-447 includes
lastUploadedPrimaryTerm, which correctly handles primary term transitions, explicitly resetting these cache fields ininitializeRemoteDirectoryOnTermUpdate(around line 549) would make the invalidation logic more transparent and resilient to future modifications.Apply this diff to reset the cache on primary term change:
private void initializeRemoteDirectoryOnTermUpdate() throws IOException { if (this.primaryTerm != indexShard.getOperationPrimaryTerm()) { logger.trace("primaryTerm update from={} to={}", primaryTerm, indexShard.getOperationPrimaryTerm()); this.primaryTerm = indexShard.getOperationPrimaryTerm(); + // Reset cache on primary term change + this.lastUploadedPrimaryTerm = INVALID_PRIMARY_TERM; + this.lastUploadedGeneration = -1; + this.lastUploadedTranslogGeneration = -1; RemoteSegmentMetadata uploadedMetadata = this.remoteDirectory.init();
440-444: Minor: Translog generation retrieved before deduplication check.Lines 440-444 retrieve and validate the translog generation before the deduplication check at lines 445-451. If deduplication skips the upload (returns early at line 450), this retrieval work is wasted.
For a minor optimization, consider moving the translog generation retrieval after the deduplication check, but this requires restructuring the cache to store whether translog generation has been retrieved. Given the complexity and minimal performance gain, this is a low-priority refinement.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-latest)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
445-451: Critical: Verify snapshot compatibility with deduplication logic.The review raises concerns that the deduplication check (lines 445-451) may break snapshot consistency, evidenced by test failure showing snapshot state becoming PARTIAL instead of SUCCESS. Before merging:
Confirm the test failure: Verify that
BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2reproduces the PARTIAL snapshot state with this deduplication logic.Validate state tracking completeness: Confirm that primary term, segment generation, and translog generation are sufficient to deduplicate metadata uploads, and that
maxSeqNocannot advance independently without triggering one of these values.Understand snapshot protocol: Investigate whether snapshot operations depend on metadata uploads as synchronization points or coordination signals.
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java
Outdated
Show resolved
Hide resolved
|
❌ Gradle check result for 9f5abc5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
275-281: Delete the commented-out legacy metadata-upload branchThis entire
if (localSegmentsPostRefresh.stream().allMatch(...))block is now preserved only as comments and duplicates the logic that used to sit here. Version control already captures that history, so keeping it commented just adds noise around the core happy-path (always callinguploadMetadata). Since a prior review already requested this cleanup, it’s better to delete it outright.- // if (localSegmentsPostRefresh.stream().allMatch(file -> skipUpload(file))) { - // logger.debug("Skipping metadata upload - no new segments were uploaded"); - // } else { - // // Start metadata file upload - // uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint); - // logger.debug("Metadata upload successful"); - // }
🧹 Nitpick comments (2)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (2)
94-96: Clarify sentinel initialization comment or align constants for all cached fieldsThe inline comment
// Use constant or -1onlastUploadedPrimaryTermis ambiguous now thatINVALID_PRIMARY_TERMis already in use, while the generation/translog fields still use raw-1. Either remove the comment or make the intent explicit (e.g., introduce analogous constants for “no generation/no translog generation” and reference them here) so future readers aren’t left wondering if this is a TODO.
439-472: Validate deduplication key invariants and reconsider exception type for null translog generationThe new dedup guard in
uploadMetadatalooks like the right place to centralize “skip if nothing has changed” logic, and keying it off(primaryTerm, segmentInfos.getGeneration(), translogFileGeneration)is reasonable. A couple of points are worth double‑checking:
Dedup key completeness: Please confirm there is no scenario where the metadata content (notably the checkpoint/userData you derive via
currentOngoingRefreshCheckpoint()) can change while these three identifiers remain the same. If such a scenario exists, this guard would incorrectly skip uploads and could explain situations like the PARTIAL snapshot state you’re seeing. If the invariant does hold (“metadata is entirely determined by primaryTerm + segment gen + translog generation”), consider adding a short comment documenting that assumption to prevent future regressions whenuploadMetadataevolves.Null
TranslogGenerationhandling: ThrowingUnsupportedOperationExceptionfor anullTranslogGenerationreads more like “this API is not supported” than “we hit an impossible runtime state.” If remote store is only expected to run with a non‑null translog generation,IllegalStateException(or even an assertion) would better communicate that this is a broken invariant rather than an unsupported feature.If you agree, updating the exception type and adding a brief comment on the dedup invariant would tighten this logic without changing behavior.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: check-files
|
❌ Gradle check result for d285df8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
276-282: Remove the commented-out metadata-skip branch.This entire block is dead code that duplicates the now-superseded skip logic and is already captured in git history. Keeping it commented out adds noise and risks confusion if future changes diverge from it. It’s safe to delete these lines outright.
🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
94-97: Clarify sentinel usage and remove the inline “Use constant or -1” note.The new
lastUploaded*fields fit the dedup logic, but the mixed use ofINVALID_PRIMARY_TERMand raw-1plus the inline “Use constant or -1” comment is a bit confusing. Consider either:
- Introducing dedicated sentinel constants for generation/translog/maxSeqNo (similar to
INVALID_PRIMARY_TERM), or- Adding a short field-level comment explaining that
-1represents “no prior upload” and dropping the inline TODO-style note.This keeps the invariants obvious and removes the magic numbers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: gradle-check
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, macos-15)
- GitHub Check: detect-breaking-change
- GitHub Check: Analyze (java)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
440-475: Dedup guard and metadata upload flow look correct for the translog/seqNo case.Using the tuple (primary term, Lucene generation, translog file generation, maxSeqNo) to decide whether to skip metadata aligns with the goal of still uploading when only the translog/maxSeqNo advance (no new segment files). The null check on
getTranslogGeneration()plus updating thelastUploaded*cache only after a successfuluploadMetadatacall also look sound for retries and avoids stale-cache issues.I don’t see functional problems in this block; it should fix the earlier “skip when only translog advanced” bug while avoiding redundant uploads.
|
❌ Gradle check result for 02dee94: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java (1)
275-281: Remove commented-out code.The commented-out lines should be deleted entirely. Version control maintains the history.
Apply this diff:
try { logger.debug("New segments upload successful"); uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint); logger.debug("Metadata upload successful"); - // if (localSegmentsPostRefresh.stream().allMatch(file -> skipUpload(file))) { - // logger.debug("Skipping metadata upload - no new segments were uploaded"); - // } else { - // // Start metadata file upload - // uploadMetadata(localSegmentsPostRefresh, segmentInfos, checkpoint); - // logger.debug("Metadata upload successful"); - // } clearStaleFilesFromLocalSegmentChecksumMap(localSegmentsPostRefresh);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
|
|
||
| final long maxSeqNo = ((InternalEngine) indexShard.getEngine()).currentOngoingRefreshCheckpoint(); | ||
| Translog.TranslogGeneration translogGeneration = indexShard.getEngine().translogManager().getTranslogGeneration(); | ||
| if (translogGeneration == null) { | ||
| throw new UnsupportedOperationException("Encountered null TranslogGeneration while uploading metadata to remote segment store"); | ||
| } | ||
| long translogFileGeneration = translogGeneration.translogFileGeneration; | ||
| final long currentPrimaryTerm = replicationCheckpoint.getPrimaryTerm(); | ||
| final long currentLuceneGeneration = segmentInfos.getGeneration(); | ||
| final long currentCheckpointVersion = replicationCheckpoint.getSegmentInfosVersion(); | ||
| if (this.lastUploadedPrimaryTerm == currentPrimaryTerm | ||
| && this.lastUploadedLuceneGeneration == currentLuceneGeneration | ||
| && this.lastUploadedCheckpointVersion == currentCheckpointVersion) { | ||
|
|
||
| // LOGIC: The index state (segment files, primary authority, and replication progress) | ||
| // is identical to the last state successfully uploaded. Skip the costly remote I/O. | ||
| return; | ||
| } | ||
|
|
||
| SegmentInfos segmentInfosSnapshot = segmentInfos.clone(); | ||
| Map<String, String> userData = segmentInfosSnapshot.getUserData(); | ||
| userData.put(LOCAL_CHECKPOINT_KEY, String.valueOf(maxSeqNo)); | ||
| userData.put(SequenceNumbers.MAX_SEQ_NO, Long.toString(maxSeqNo)); | ||
| segmentInfosSnapshot.setUserData(userData, false); | ||
|
|
||
| Translog.TranslogGeneration translogGeneration = indexShard.getEngine().translogManager().getTranslogGeneration(); | ||
| if (translogGeneration == null) { | ||
| throw new UnsupportedOperationException("Encountered null TranslogGeneration while uploading metadata to remote segment store"); | ||
| } else { | ||
| long translogFileGeneration = translogGeneration.translogFileGeneration; | ||
| remoteDirectory.uploadMetadata( | ||
| localSegmentsPostRefresh, | ||
| segmentInfosSnapshot, | ||
| storeDirectory, | ||
| translogFileGeneration, | ||
| replicationCheckpoint, | ||
| indexShard.getNodeId() | ||
| ); | ||
| } | ||
| remoteDirectory.uploadMetadata( | ||
| localSegmentsPostRefresh, | ||
| segmentInfosSnapshot, | ||
| storeDirectory, | ||
| translogFileGeneration, | ||
| replicationCheckpoint, | ||
| indexShard.getNodeId() | ||
| ); | ||
|
|
||
| this.lastUploadedPrimaryTerm = currentPrimaryTerm; | ||
| this.lastUploadedLuceneGeneration = currentLuceneGeneration; | ||
| this.lastUploadedCheckpointVersion = currentCheckpointVersion; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Deduplication check must include translog generation.
The deduplication logic (lines 448-450) checks primaryTerm, luceneGeneration, and checkpointVersion, but omits translogFileGeneration. When the translog advances without creating new segments (e.g., documents indexed but not yet flushed), the three checked values remain unchanged, causing the method to return early (line 454) and skip the metadata upload. This leaves the remote store with stale translog generation information.
This likely explains the test failure in the PR description: the snapshot expects up-to-date metadata reflecting the current translog state but instead finds stale information, resulting in a PARTIAL snapshot state instead of SUCCESS.
To fix this, add lastUploadedTranslogGeneration tracking and include it in the deduplication check:
private volatile long lastUploadedPrimaryTerm = -1L;
private volatile long lastUploadedLuceneGeneration = -1L;
private volatile long lastUploadedCheckpointVersion = -1L;
+ private volatile long lastUploadedTranslogGeneration = -1L;Then update the deduplication check:
final long currentPrimaryTerm = replicationCheckpoint.getPrimaryTerm();
final long currentLuceneGeneration = segmentInfos.getGeneration();
final long currentCheckpointVersion = replicationCheckpoint.getSegmentInfosVersion();
if (this.lastUploadedPrimaryTerm == currentPrimaryTerm
&& this.lastUploadedLuceneGeneration == currentLuceneGeneration
- && this.lastUploadedCheckpointVersion == currentCheckpointVersion) {
+ && this.lastUploadedCheckpointVersion == currentCheckpointVersion
+ && this.lastUploadedTranslogGeneration == translogFileGeneration) {
// LOGIC: The index state (segment files, primary authority, and replication progress)
// is identical to the last state successfully uploaded. Skip the costly remote I/O.
return;
}And update the cache after successful upload:
this.lastUploadedPrimaryTerm = currentPrimaryTerm;
this.lastUploadedLuceneGeneration = currentLuceneGeneration;
this.lastUploadedCheckpointVersion = currentCheckpointVersion;
+ this.lastUploadedTranslogGeneration = translogFileGeneration;Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
server/src/main/java/org/opensearch/index/shard/RemoteStoreRefreshListener.java
around lines 438 to 475, the deduplication check omits the translog file
generation so uploads are skipped when translog advances without new segments;
add a long field lastUploadedTranslogGeneration to the class, include a
comparison of lastUploadedTranslogGeneration == translogFileGeneration in the
if-condition alongside primaryTerm, luceneGeneration and checkpointVersion, and
after a successful remoteDirectory.uploadMetadata(...) set
this.lastUploadedTranslogGeneration = translogFileGeneration (ensure the field
is initialized to a sentinel like -1L so first upload proceeds).
Signed-off-by: sjs004 <[email protected]>
|
❌ Gradle check result for 98319a3: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
|
❌ Gradle check result for 44dab6a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
|
❌ Gradle check result for e2fec2b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: sjs004 <[email protected]>
|
❕ Gradle check result for ba4c4bf: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20192 +/- ##
============================================
- Coverage 73.30% 73.20% -0.11%
+ Complexity 71804 71727 -77
============================================
Files 5793 5793
Lines 328160 328184 +24
Branches 47266 47268 +2
============================================
- Hits 240559 240231 -328
- Misses 68295 68689 +394
+ Partials 19306 19264 -42 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Optimizes remote store refresh performance by avoiding unnecessary metadata uploads when index state is unchanged
Related Issues
Resolves #17530
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.