Add UTF-16/UTF-8 byte offset conversion utilities#10873
Add UTF-16/UTF-8 byte offset conversion utilities#10873tilladam wants to merge 4 commits intoslint-ui:masterfrom
Conversation
Add unicode_utils module to i-slint-core with utility functions for converting between UTF-8 byte offsets and UTF-16 code unit offsets, and for snapping byte offsets to character boundaries. Replace duplicate inline implementations in the Android backend (javahelper.rs), Qt backend (qt_window.rs), and core text handling (text.rs) with calls to the shared module.
tronical
left a comment
There was a problem hiding this comment.
I'm generally in favour of de-duplicating code when there's value - in terms of improving readability or unifying complexity. I have the feeling that the LSP might have similar code duplicated.
That said, I prefer introducing code when it's used.
So here the android and qt backend could benefit from a byte_offset_to_utf16_offset shared function, because that's shared. I even agree :-) that the expression here becomes more readable, i.e. byte_offset_to_utf16_offset is better than in_str[..utf8_index].encode_utf16().count(). We could bike-shed if we should call them indices or offsets, but I'm fine either way :-).
Also, while qt and android backends use different versions of the inverse (clamped vs. non-clamped), they could be seen as "siblings" and I think that's also fine to share in this PR.
However, for the other functions I'd prefer them to be introduced with changes that use them, so it's more apparent what the value is.
internal/core/unicode_utils.rs
Outdated
| /// ``` | ||
| pub fn byte_offset_to_utf16_offset(text: &str, byte_offset: usize) -> usize { | ||
| assert!( | ||
| is_valid_byte_offset(text, byte_offset), |
There was a problem hiding this comment.
This is - as far as I can tell - the only call site for a function where I feel the one-liner it is (offset <= text.len() && text.is_char_boundary(offset)) is more readable than is_valid_byte_offset().
I'd prefer to inline here.
However: Why is a panic here better than what I suspect would be "rounding up" behaviour?
There was a problem hiding this comment.
Code wise this seems "resolved" in the sense of inlined, but I'm curious about what you think about my panic question :)
| /// assert_eq!(utf16_offset_to_byte_offset("a😀b", 2), None); // inside surrogate pair | ||
| /// assert_eq!(utf16_offset_to_byte_offset("a😀b", 3), Some(5)); | ||
| /// ``` | ||
| pub fn utf16_offset_to_byte_offset(text: &str, utf16_offset: usize) -> Option<usize> { |
There was a problem hiding this comment.
There's no call site for this function yet. I'd prefer to introduce this function in a PR that uses the code.
f7d2453 to
d6d9a47
Compare
Add unicode_utils module to i-slint-core with utility functions for converting between UTF-8 byte offsets and UTF-16 code unit offsets, and for snapping byte offsets to character boundaries. Replace duplicate inline implementations in the Android backend (javahelper.rs), Qt backend (qt_window.rs), and core text handling (text.rs) with calls to the shared module. floor_byte_offset / ceil_byte_offset are polyfills for str::floor_char_boundary / str::ceil_char_boundary (stabilized in Rust 1.91, MSRV is currently 1.88).
017e104 to
6fa0195
Compare
tronical
left a comment
There was a problem hiding this comment.
AFAICS there are two unresolved comments :)
Summary
unicode_utilsmodule toi-slint-coreconsolidating duplicate UTF-16/UTF-8 offset conversion codeandroid-activity/javahelper.rs,qt/qt_window.rs, andcore/items/text.rswith calls to the shared moduleceil_byte_offset,floor_byte_offset) replace a hand-rolledfloor_char_boundaryequivalent (stdlib version requires Rust 1.91, MSRV is 1.88)Functions
Byte offset utilities:
is_valid_byte_offset— check if offset is on a UTF-8 character boundaryfloor_byte_offset/ceil_byte_offset— snap to nearest valid boundarybyte_offset_to_char_count/char_count_to_byte_offset— character count conversionsUTF-16 conversions (needed for Android InputConnection + iOS UITextInput):
utf16_offset_to_byte_offset— UTF-16 → UTF-8 (Nonefor invalid mid-surrogate offsets)byte_offset_to_utf16_offset— UTF-8 → UTF-16utf16_offset_to_byte_offset_clamped— UTF-16 → UTF-8 with forward clamping (matches originalconvert_utf16_index_to_utf8semantics)Test plan
cargo buildfor i-slint-core, i-slint-backend-qt, i-slint-backend-android-activitycargo test -p i-slint-core unicode_utilsSplit out from #10557 per review feedback to use smaller increments.