Fix closed-index reroute handling - Shard routings for closed index are allocated again without opening the index#20558
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request fixes a bug where shard routings for closed indices were being allocated during reroute operations. A private helper method is added to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java (1)
635-645:⚠️ Potential issue | 🟠 MajorBatch mode bypasses closed-index shard filtering
When batch mode is enabled (
EXISTING_SHARDS_ALLOCATOR_BATCH_MODEsetting and all nodes on 2.14.0+), the allocation service returns early before applying theisShardFromClosedIndex()check. The batch allocator'screateAndUpdateBatches()method iterates through all unassigned shards without filtering closed indices, so these shards may be allocated viaShardsBatchGatewayAllocator. Add an equivalent closed-index guard in the batch path and test it in batch-mode scenarios.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
CHANGELOG.mdserver/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.javaserver/src/test/java/org/opensearch/cluster/routing/allocation/AllocationServiceTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.
Applied to files:
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: detect-breaking-change
🔇 Additional comments (3)
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java (1)
610-613: Clear closed-index helperNice encapsulation; it keeps the allocation loop readable and focused.
server/src/test/java/org/opensearch/cluster/routing/allocation/AllocationServiceTests.java (1)
574-682: Solid regression coverage for closed-index rerouteThe test validates unassigned state and reason, plus index state preservation—good coverage for the fix.
CHANGELOG.md (1)
15-15: Changelog entry aligns with the fixClear and correctly categorized under “Fixed.”
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
❌ Gradle check result for 5d30b5a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@andrross My changes only affect shard allocation for closed indices during reroute operations. The failing tests don't involve closed indices or the allocation code paths I modified. These appear to be pre-existing flaky tests. |
|
❌ Gradle check result for 8d41b62: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 09313fd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 09313fd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@srikanthpadakanti The past two test runs have failed with the same error: I've not seen that specific failure before. Are you sure it's not related to your changes? |
2f6ff83 to
48c5d94
Compare
|
❌ Gradle check result for 48c5d94: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for ee3cbfd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
ee3cbfd to
58ea3c0
Compare
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
|
❌ Gradle check result for 62b771a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
|
❌ Gradle check result for 725024e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
@andrross When an index is explicitly closed (not during cluster recovery), should the shards: Remain on nodes in a read-only/closed state (allows translog trimming) OR become unassigned and stay unassigned ? My fix failing some existing tests like - IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex |
Description
Prevents allocation of shards from closed indices during cluster reroute operations. Added a check in
AllocationService.allocateExistingUnassignedShards()to skip shards belonging to closed indices, ensuring they remain unassigned until the index is explicitly reopened.Related Issues
Resolves #20200
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.