improve stoppable check APIs to include failure details by laxman-ch · Pull Request #63 · linkedin/helix

laxman-ch · 2025-10-28T18:39:02Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Feature Request: Enhance stoppable instance checks with partition-level diagnostic information

This enhancement improves the stoppable instance API by providing granular diagnostic details when replica count validations fail, enabling better operational visibility.

Description

Here are some details about my PR, including screenshots of any UI changes:

Why: Operations teams previously received only high-level error codes when instance shutdown was blocked due to replica constraints. This lack of specificity made it difficult to identify which resources or partitions were preventing safe instance removal, leading to longer troubleshooting cycles.

How:

Introduced a result container class for capturing validation outcome details
Extended service layer to generate human-readable diagnostic messages
Added optional query parameter to both batch and individual instance endpoints
Preserved existing API contracts and response formats

Sample Enhanced Output:

HELIX:MIN_ACTIVE_REPLICA_CHECK_FAILED: Resource MyDatabase partition MyDatabase_shard_5 has 1/3 active replicas

Tests

The following tests are written for this issue:

Core Validation Tests:

testSiblingNodesActiveReplicaCheckWithDetailsSuccess() - Validates successful replica count verification with diagnostic mode
testSiblingNodesActiveReplicaCheckWithDetailsFailure() - Confirms detailed error capture when replica thresholds not satisfied

REST Endpoint Tests:

testInstancesStoppablePostWithIncludeDetails() - Verifies batch endpoint diagnostic functionality
testIsInstanceStoppableWithIncludeDetailsDefault() - Confirms unchanged default behavior
testIsInstanceStoppableWithIncludeDetailsFalse() - Validates explicit diagnostic mode disabled
testIsInstanceStoppableWithIncludeDetailsTrue() - Tests diagnostic mode activation
testPerInstanceStoppableWithIncludeDetailsForMinActiveReplica() - Cross-validates diagnostic vs standard responses

Test Results:
All existing and new test suites pass without failures. Backward compatibility verified through regression testing.

The following is the result of the "mvn test" command on the appropriate module:

[INFO] Tests run: 26, Failures: 0, Errors: 0, Skipped: 0 [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0 [INFO] BUILD SUCCESS

Changes that Break Backward Compatibility (Optional)

Zero breaking changes introduced. This enhancement is fully additive:

Existing API consumers continue to receive identical responses
New diagnostic parameter defaults to disabled state
Enhanced information only appears when explicitly requested
All historical error message formats preserved

Documentation (Optional)

In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

LZD-PratyushBhatt

Good work! Overall LGTM as per our discussion.

Few comments I've added. Plus please add more test coverage around format Validation and other cases based on the review comments.

LZD-PratyushBhatt · 2025-10-29T11:31:59Z

helix-core/src/main/java/org/apache/helix/util/InstanceValidationUtil.java

-                "Partition {} doesn't have enough active replicas in sibling nodes. NumHealthySiblings: {}, minActiveReplicas: {}",
-                partition, numHealthySiblings, minActiveReplicas);
-            return false;
+            _logger.warn(


I would suggest to move this to debug log instead. IDTS it qualifies as WARN.
Also, debug because now we dont just check first failure, we keep going even after the faiilures, so this becomes a Hot path now. In the end we can post a summary.

we keep going even after the faiilures, so this becomes a Hot path now.

Nope. we are still failing fast and provide only the details of first partition that's violating the min active replica check. And I chose warn as this may require attention and helpful while debugging.

Wait, I just noticed, why are we not collecting all the failed partitions?

LZD-PratyushBhatt · 2025-10-29T11:36:33Z

helix-core/src/main/java/org/apache/helix/util/InstanceValidationUtil.java

+   *                             shouldn't contain the `instanceName`
+   * @return MinActiveReplicaCheckResult with pass/fail status and details of first failure
+   */
+  public static MinActiveReplicaCheckResult siblingNodesActiveReplicaCheckWithDetails(


The current implementation doesnt have 1:1 backward compatibility, even if includeDetails is not passed, still we will enter this method and will check all the paritions nonetheless.
I will propose to add a flag in this method like "failOnFirst" which will be true if "includeDetails" wasnt passed or is false, and bail out on first failure.

This implementation is fully backward compatible with existing. We still fail fast and provide details only for the first partition violating the min active replica check.

Just noticed. Why are we not collecting all the failed partitions?

The idea was to provide an indication to the failed partitions since anyways the result will be truncated.

The result will be truncated in Nimbus UI/Customized status. But we can add logging on ACM side for earlier debugging if we have this

I agree with many partitions part. Here's what I think can be smartly done.
if error list < 10, list all, and for all remaining, just give a count summary.

Maintaining backward compatibility is simple like I mentioned in the original comment with the use of "failOnFirst"

imho, as already mentioned in response to other comment above, we need to iterate on this feature.

Adding more partition details in string format, I'm not very much convinced.
want to see customers feedback on the current version and iterate based on that.

We need a better request/response modelling to handle all other checks failures too in addition to MIN_ACTIVE_REPLICA check.
Have some thoughts on it. But want to iterate based on the feedback.

Adding more flags in the current API makes them irrelevant for other flavours of the APIs.

Few observations:

Logging it on ACM is also not a good idea if that information can be readily fetched.

As @laxman-ch mentioned we can reiterate based on customer feedback and also, the current ask is for Nimbus UI, so this should be good.

We should anyways have an API, which should be provided to give information on all partitions if this is something we (as Helix team) or the customers need.

I like the idea of mentioning a count summary, @laxman-ch do you think we can take this up in this PR or can be a followup?

I like the idea of mentioning a count summary, @laxman-ch do you think we can take this up in this PR or can be a followup?

Thought about it during the implementation and discussed with @LZD-PratyushBhatt too at that time.

Count info may not provide any info on which users can take an action without additional debugging/api calls which they are doing it today.

"500" partitions failed the replica check

And another reason for not including it is, don't want to pack too many details in a string message. Right approach is to add all these flags/details should be part of the request/response model.

By count, I was mentioning the count of INITIAL state (like OFFLINE) and ERROR state.

I was against just printing the first partition info, because if you have 10 partition in ERROR state, you fix one, that doesn't affect 9 others. Rebalancing doesn't have any affect on ERROR partitions.

A count summary atleast let's the USER know that they need to look at Current state of the participant to get list of all the initial/error state partitions.

Anyways, my comment/suggestion is non-binding and I won't block the PR because of this.

Overall LGTM

bellatrix007 · 2025-10-30T14:01:34Z

helix-core/src/main/java/org/apache/helix/util/InstanceValidationUtil.java

+  /**
+   * Result class for min active replica check with details about the first failure
+   */
+  public static class MinActiveReplicaCheckResult {


nit: Do we want to move this to a separate class?

done. for now, moved to a separate internal implementation class itself.

however, we need to iterate on this to provide detailed responses for all failure checks.
for that, we need to have a dedicated request/response object model as part of rest apis.
current model of providing strings as response model is very erroneous.

bellatrix007 · 2025-10-30T14:07:12Z

.../main/java/org/apache/helix/rest/clusterMaintenanceService/MaintenanceManagementService.java

+              String detailedMessage = String.format("%s: Resource %s partition %s has %d/%d active replicas",
+                  HealthCheck.MIN_ACTIVE_REPLICA_CHECK_FAILED.name(),
+                  result.getResourceName(), result.getPartitionName(),
+                  result.getCurrentActiveReplicas(), result.getRequiredMinActiveReplicas());


Can we create to String method in MinActiveReplicaCheckResult to avoid this?

bellatrix007 · 2025-10-30T14:08:15Z

.../main/java/org/apache/helix/rest/clusterMaintenanceService/MaintenanceManagementService.java

@@ -379,11 +396,16 @@ public Map<String, StoppableCheck> batchGetInstancesStoppableChecks(String clust

  public Map<String, StoppableCheck> batchGetInstancesStoppableChecks(String clusterId,


Do we want to overload both getInstanceStoppableCheck and batchGetInstancesStoppableChecks?

iiuc, the comment is about having fewer methods.
agree to that in general. thought about in self review too.

with the current context, overloaded methods looks better approach.
once we introduce appropriate request/response model, builder patterns looks to be the right way.

LZD-PratyushBhatt · 2025-10-30T18:26:50Z

.../main/java/org/apache/helix/rest/clusterMaintenanceService/MaintenanceManagementService.java

+              String detailedMessage = String.format("%s: Resource %s partition %s has %d/%d active replicas",
+                  HealthCheck.MIN_ACTIVE_REPLICA_CHECK_FAILED.name(),
+                  result.getResourceName(), result.getPartitionName(),
+                  result.getCurrentActiveReplicas(), result.getRequiredMinActiveReplicas());


Also I would suggest to shorten this.

HELIX:MIN_ACTIVE_REPLICA_CHECK_FAILED: Resource MyDatabase partition MyDatabase_shard_5 has 1/3 active replicas is too lenghty

HELIX:MIN_ACTIVE_REPLICA_CHECK_FAILED: MyDatabase_MyDatabase_shard_5_1/3_replicas

This should work too.,

This is human readable string like a log message or status message. Don't want to introduce a fragile contract (underscore delimited) in this. Also, in general, the resource names and partition names can be free form. Current string looks good to me as it doesn't make any assumptions of these names and don't introduce an unnecessary contract.

improve stoppable check APIs to include failure details

1e51314

laxman-ch marked this pull request as ready for review October 29, 2025 05:04

laxman-ch requested review from LZD-PratyushBhatt, bellatrix007 and proud-parselmouth October 29, 2025 05:04

LZD-PratyushBhatt reviewed Oct 29, 2025

View reviewed changes

bellatrix007 reviewed Oct 30, 2025

View reviewed changes

LZD-PratyushBhatt reviewed Oct 30, 2025

View reviewed changes

fix review comments

6767b95

bellatrix007 approved these changes Nov 3, 2025

View reviewed changes

laxman-ch merged commit 8f646df into dev Nov 3, 2025
1 of 2 checks passed

		@@ -379,11 +396,16 @@ public Map<String, StoppableCheck> batchGetInstancesStoppableChecks(String clust

		public Map<String, StoppableCheck> batchGetInstancesStoppableChecks(String clusterId,

Conversation

laxman-ch commented Oct 28, 2025

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

Uh oh!

LZD-PratyushBhatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LZD-PratyushBhatt Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LZD-PratyushBhatt Nov 3, 2025 •

edited

Loading