Skip to content

Fix closed-index reroute handling - Shard routings for closed index are allocated again without opening the index#20558

Open
srikanthpadakanti wants to merge 6 commits intoopensearch-project:mainfrom
srikanthpadakanti:20200-closed-index-reroute-fix
Open

Fix closed-index reroute handling - Shard routings for closed index are allocated again without opening the index#20558
srikanthpadakanti wants to merge 6 commits intoopensearch-project:mainfrom
srikanthpadakanti:20200-closed-index-reroute-fix

Conversation

@srikanthpadakanti
Copy link
Contributor

Description

Prevents allocation of shards from closed indices during cluster reroute operations. Added a check in AllocationService.allocateExistingUnassignedShards() to skip shards belonging to closed indices, ensuring they remain unassigned until the index is explicitly reopened.

Related Issues

Resolves #20200

Check List

  • [ X ] Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request fixes a bug where shard routings for closed indices were being allocated during reroute operations. A private helper method is added to AllocationService to identify shards from closed indices, which is then used to skip allocation in both primary and replica unassigned shard processing paths. A test validates that closed index shards remain unallocated after reroute.

Changes

Cohort / File(s) Summary
Closed Index Shard Allocation Fix
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java, server/src/test/java/org/opensearch/cluster/routing/allocation/AllocationServiceTests.java
Added private helper method isShardFromClosedIndex() to check if a shard's index is in CLOSED state. Used this helper to skip allocation for closed-index shards in primary and replica unassigned shard allocation paths. New test testClosedIndexShardsAreNotAllocatedDuringReroute() validates that shards from closed indices remain unallocated with INDEX\_CLOSED reason after reroute operations.
Changelog
CHANGELOG.md
Added entry documenting that shard routings for closed indices are no longer allocated without opening the index.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: preventing shard allocation for closed indices during reroute operations.
Description check ✅ Passed The PR description covers the key change, links the related issue (#20200), and confirms testing was included per the checklist.
Linked Issues check ✅ Passed The changes directly address issue #20200 by implementing the allocation check to skip shards from closed indices during reroute.
Out of Scope Changes check ✅ Passed All changes are focused and directly related to fixing closed-index reroute handling; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java (1)

635-645: ⚠️ Potential issue | 🟠 Major

Batch mode bypasses closed-index shard filtering

When batch mode is enabled (EXISTING_SHARDS_ALLOCATOR_BATCH_MODE setting and all nodes on 2.14.0+), the allocation service returns early before applying the isShardFromClosedIndex() check. The batch allocator's createAndUpdateBatches() method iterates through all unassigned shards without filtering closed indices, so these shards may be allocated via ShardsBatchGatewayAllocator. Add an equivalent closed-index guard in the batch path and test it in batch-mode scenarios.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ba2f37 and 1b5820c.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java
  • server/src/test/java/org/opensearch/cluster/routing/allocation/AllocationServiceTests.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: Analyze (java)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: detect-breaking-change
🔇 Additional comments (3)
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java (1)

610-613: Clear closed-index helper

Nice encapsulation; it keeps the allocation loop readable and focused.

server/src/test/java/org/opensearch/cluster/routing/allocation/AllocationServiceTests.java (1)

574-682: Solid regression coverage for closed-index reroute

The test validates unassigned state and reason, plus index state preservation—good coverage for the fix.

CHANGELOG.md (1)

15-15: Changelog entry aligns with the fix

Clear and correctly categorized under “Fixed.”

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

❌ Gradle check result for 5d30b5a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@srikanthpadakanti
Copy link
Contributor Author

@andrross My changes only affect shard allocation for closed indices during reroute operations. The failing tests don't involve closed indices or the allocation code paths I modified. These appear to be pre-existing flaky tests.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 7, 2026

❌ Gradle check result for 8d41b62: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

❌ Gradle check result for 09313fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 09313fd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member

@srikanthpadakanti The past two test runs have failed with the same error:

» ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [v2.19.5-remote-0] fatal error in thread [opensearch[v2.19.5-remote-0][clusterApplierService#updateTask][T#1]], exiting
»  java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode [turn_off_translog_retention][0], node[iG6UKdabScSW_TM7801bUw], [P], s[STARTED], a[id=MQpv3lH-Q3mgnk9I9mXujQ]

I've not seen that specific failure before. Are you sure it's not related to your changes?

@srikanthpadakanti srikanthpadakanti force-pushed the 20200-closed-index-reroute-fix branch from 2f6ff83 to 48c5d94 Compare February 12, 2026 15:54
@github-actions
Copy link
Contributor

❌ Gradle check result for 48c5d94: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for ee3cbfd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Srikanth Padakanti added 4 commits February 12, 2026 16:30
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti srikanthpadakanti force-pushed the 20200-closed-index-reroute-fix branch from ee3cbfd to 58ea3c0 Compare February 12, 2026 22:30
Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 62b771a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Srikanth Padakanti <srikanth29.9@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 725024e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@srikanthpadakanti
Copy link
Contributor Author

@andrross When an index is explicitly closed (not during cluster recovery), should the shards: Remain on nodes in a read-only/closed state (allows translog trimming) OR become unassigned and stay unassigned ?

My fix failing some existing tests like - IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Shard routings for closed index are allocated again without opening the index

2 participants