Skip to content

Conversation

@UtkarshSahay123
Copy link
Contributor

What does this PR do?

This PR fixes UTF-8 boundary validation in substring kernels for sliced
Utf8 and LargeUtf8 arrays.

Previously, UTF-8 boundary checks were performed against the full underlying
buffer, which could lead to incorrect validation when arrays were sliced.
This change ensures boundaries are validated relative to each value.

Why is this change needed?

Substring kernels operate on value-relative offsets. Validating offsets
against the global buffer can incorrectly reject valid boundaries or accept
invalid ones when arrays are sliced. This fix aligns validation with
value-level semantics.

What changes were made?

  • Perform UTF-8 boundary validation relative to per-value slices
  • Preserve existing behavior for unsliced arrays
  • No API changes

Tests

  • Existing substring tests cover this behavior
  • No new tests were required

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 18, 2025
Copy link
Contributor

@mhilton mhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this change is necessary. Slicing a (Large)StringArray doesn't change the data buffer, so the offsets into the data buffer also do not change. Do you have an example of a StringArray where the existing code produces incorrect results?

@UtkarshSahay123
Copy link
Contributor Author

UtkarshSahay123 commented Dec 18, 2025 via email

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

That said, I agree that this distinction is subtle, and without a concrete
example where the current implementation produces incorrect results, the
change may not be justified. I will try to construct a minimal reproducer or
add a targeted test demonstrating this behavior. If I’m unable to do so, I’m
happy to drop or revise the change.

This sounds like a good plan

@alamb alamb marked this pull request as draft December 19, 2025 20:09
@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@UtkarshSahay123 UtkarshSahay123 marked this pull request as ready for review December 20, 2025 16:36
@UtkarshSahay123
Copy link
Contributor Author

Thanks for the note! I’ve marked the PR as ready for review now. Please let me know if any further changes are needed.

@alamb
Copy link
Contributor

alamb commented Dec 23, 2025

I don't think we can accept this PR without a test demonstrating what issue it is fixing

@Jefffrey Jefffrey marked this pull request as draft December 23, 2025 15:43
@UtkarshSahay123 UtkarshSahay123 marked this pull request as ready for review December 24, 2025 17:05
@Jefffrey Jefffrey marked this pull request as draft December 25, 2025 02:05
@Jefffrey
Copy link
Contributor

@UtkarshSahay123 Could you please refrain from marking this PR as ready for review until the above comments have been addressed?

@UtkarshSahay123 UtkarshSahay123 marked this pull request as ready for review December 26, 2025 08:49
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added test already passes on main without this fix, and this PR removes two existing tests which seem to test a similar scenario. I'm having trouble understanding what fix this PR is providing.

@UtkarshSahay123
Copy link
Contributor Author

Thanks for the review and for pointing this out.

After re-checking, I agree that the added test currently passes on main and does not clearly demonstrate a regression by itself. Removing the existing tests also makes the intent of this change unclear.

I’ll convert this PR back to draft and re-evaluate whether there is a concrete issue being fixed here. If I’m unable to identify a clear failing case on main, I’ll adjust or close this PR accordingly.

Thanks again for taking the time to review.

@UtkarshSahay123 UtkarshSahay123 marked this pull request as draft December 26, 2025 13:49
@UtkarshSahay123
Copy link
Contributor Author

I added a minimal repro test to validate UTF-8 boundary handling on sliced StringArray values.

After testing on upstream main, the boundary validation already behaves correctly and the test passes without any changes. This confirms that there is no actual regression to fix here.

Closing this PR based on the feedback — thanks for the review and guidance.

@alamb
Copy link
Contributor

alamb commented Dec 29, 2025

Thank you @UtkarshSahay123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants