Skip to content

Add UTF-16/UTF-8 byte offset conversion utilities#10873

Open
tilladam wants to merge 4 commits intoslint-ui:masterfrom
tilladam:unicode-utils
Open

Add UTF-16/UTF-8 byte offset conversion utilities#10873
tilladam wants to merge 4 commits intoslint-ui:masterfrom
tilladam:unicode-utils

Conversation

@tilladam
Copy link
Contributor

Summary

  • Adds unicode_utils module to i-slint-core consolidating duplicate UTF-16/UTF-8 offset conversion code
  • Replaces inline implementations in android-activity/javahelper.rs, qt/qt_window.rs, and core/items/text.rs with calls to the shared module
  • Byte offset boundary helpers (ceil_byte_offset, floor_byte_offset) replace a hand-rolled floor_char_boundary equivalent (stdlib version requires Rust 1.91, MSRV is 1.88)

Functions

Byte offset utilities:

  • is_valid_byte_offset — check if offset is on a UTF-8 character boundary
  • floor_byte_offset / ceil_byte_offset — snap to nearest valid boundary
  • byte_offset_to_char_count / char_count_to_byte_offset — character count conversions

UTF-16 conversions (needed for Android InputConnection + iOS UITextInput):

  • utf16_offset_to_byte_offset — UTF-16 → UTF-8 (None for invalid mid-surrogate offsets)
  • byte_offset_to_utf16_offset — UTF-8 → UTF-16
  • utf16_offset_to_byte_offset_clamped — UTF-16 → UTF-8 with forward clamping (matches original convert_utf16_index_to_utf8 semantics)

Test plan

  • 39 unit tests + 3 doctests covering ASCII, BMP, supplementary plane (emoji), combining characters, surrogate pair edge cases, empty strings, roundtrip conversions
  • cargo build for i-slint-core, i-slint-backend-qt, i-slint-backend-android-activity
  • cargo test -p i-slint-core unicode_utils

Split out from #10557 per review feedback to use smaller increments.

tilladam and others added 2 commits February 26, 2026 15:24
Add unicode_utils module to i-slint-core with utility functions for
converting between UTF-8 byte offsets and UTF-16 code unit offsets,
and for snapping byte offsets to character boundaries.

Replace duplicate inline implementations in the Android backend
(javahelper.rs), Qt backend (qt_window.rs), and core text handling
(text.rs) with calls to the shared module.
Copy link
Member

@tronical tronical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favour of de-duplicating code when there's value - in terms of improving readability or unifying complexity. I have the feeling that the LSP might have similar code duplicated.

That said, I prefer introducing code when it's used.

So here the android and qt backend could benefit from a byte_offset_to_utf16_offset shared function, because that's shared. I even agree :-) that the expression here becomes more readable, i.e. byte_offset_to_utf16_offset is better than in_str[..utf8_index].encode_utf16().count(). We could bike-shed if we should call them indices or offsets, but I'm fine either way :-).

Also, while qt and android backends use different versions of the inverse (clamped vs. non-clamped), they could be seen as "siblings" and I think that's also fine to share in this PR.

However, for the other functions I'd prefer them to be introduced with changes that use them, so it's more apparent what the value is.

/// ```
pub fn byte_offset_to_utf16_offset(text: &str, byte_offset: usize) -> usize {
assert!(
is_valid_byte_offset(text, byte_offset),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is - as far as I can tell - the only call site for a function where I feel the one-liner it is (offset <= text.len() && text.is_char_boundary(offset)) is more readable than is_valid_byte_offset().

I'd prefer to inline here.

However: Why is a panic here better than what I suspect would be "rounding up" behaviour?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise this seems "resolved" in the sense of inlined, but I'm curious about what you think about my panic question :)

/// assert_eq!(utf16_offset_to_byte_offset("a😀b", 2), None); // inside surrogate pair
/// assert_eq!(utf16_offset_to_byte_offset("a😀b", 3), Some(5));
/// ```
pub fn utf16_offset_to_byte_offset(text: &str, utf16_offset: usize) -> Option<usize> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no call site for this function yet. I'd prefer to introduce this function in a PR that uses the code.

Add unicode_utils module to i-slint-core with utility functions for
converting between UTF-8 byte offsets and UTF-16 code unit offsets,
and for snapping byte offsets to character boundaries.

Replace duplicate inline implementations in the Android backend
(javahelper.rs), Qt backend (qt_window.rs), and core text handling
(text.rs) with calls to the shared module.

floor_byte_offset / ceil_byte_offset are polyfills for
str::floor_char_boundary / str::ceil_char_boundary (stabilized in
Rust 1.91, MSRV is currently 1.88).
Copy link
Member

@tronical tronical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS there are two unresolved comments :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants