Incomplete grapheme segmentation #6931
Replies: 4 comments 4 replies
-
|
Hi, I'm not exactly an expert on this topic but I can at least address a few points from a cursory search:
Overall Ghostty's support for Indic scripts is very incomplete (see #5637), mostly because nobody on the team can read and write in an Indic language — in fact it's more accurate to say that the vast majority of the team only speaks US English and nothing else. We do have better CJK support since I speak Chinese and Mitchell speaks some Japanese, but I do realize this is not really ideal in terms of linguistic diversity. Some pointers towards better Indic support would definitely be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for pointing this out. As @pluiedev noted, I believe our implementation and verification was based on v2.8 which didn't have this. Specifically, I think Ghostty mostly supports Unicode 15.0 with an incomplete support of later versions (e.g. we support some of the new glyphs in Unicode 16 in our font sprite but not all). We previously verified our grapheme break using a verification program that ran through the entire Let me look into this more. Do you happen to have an exact sequence of codepoints that would fail this in Ghostty? |
Beta Was this translation helpful? Give feedback.
-
|
On Thu, Mar 27, 2025 at 07:09:53AM -0700, Mitchell Hashimoto wrote:
Let me look into this more. Do you happen to have an exact sequence of codepoints that would fail this in Ghostty?
Download
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt
and use it to test your implementation directly rather than
testing against utf8proc, which isnt up-to-date on the unicode spec
anyway, it stops at unicode 15.1 IIRC.
Indeed, briefly looking over the algorithm utf8proc implements
it doesn't *seem* to be spec compliant but that was after a quick look
through. It may be that their implementation is degenerate with the spec
and passes the tests, cant say without more study. It's certainly not a
direct implementation of the spec and seems to be sort of bolted on after
the fact. They have a simple_grapheme_break and then a
grapheme_break_extended that calls simple_grapheme_break. I guess they
did it this way for historical reasons? For an actual clean
implementation of the spec, look at libunistring instead, though IIRC
that's GPL licensed so wont work for you.
Anyway, I will most likely implement full UAX29 segmentation in kitty
(assuming I can get it performant enough)
and also probably develop a kitten that can be used to test terminal
emulator compliance, based on the above test data, you (and other
terminal developers) should be able to use that to check your
implementation as well.
I am rather keen to move the ecosystem to a single standard, as I am
heartily sick of bug reports about width. The correct
solution is, of course, the width part of the kitty text sizing protocol,
but, it will be a long time before that sees widespread adoption, in the
meantime I would like to see terminal emulators converge to a single
algorithm for width and segmentation. There is still the problem that
with every new version of the spec widths and segmentation properties
change, but at least, the algorithm should be stable. Hopefully, once there is a
straightforward way to test compliance, the ecosystem will move.
|
Beta Was this translation helpful? Give feedback.
-
|
FYI, this is now implemented in kitty along with the promised test kitten and a spec describing the exact algorithm. See kovidgoyal/kitty#8533 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am the developer of kitty, and I was looking into implementing more complete grapheme segmentation in kitty. While researching I discovered that ghostty claims to implement full UAX 29 grapheme clustering, at least as per my reading of: https://mitchellh.com/writing/grapheme-clusters-in-terminals
I was rather surprised to read this claim, since full UAX29 grapheme clustering is dog slow, with a lot of branches or alternately a lookup table approach. So I investigated ghostty's code and came across:
ghostty/src/unicode/grapheme.zig
Line 6 in 1c3693c
I am not familiar with zig but if I am reading that right, you are doing segmentation based on a lookup table with just two bits of state + 8 bits for prev and next char grapheme break property. This is impossible. You seem to be completely ignoring indic conjunct breaks, which are needed for rule GB9c: https://www.unicode.org/reports/tr29/#GB9c
The code claims to be based on utf8proc, but utf8proc supports GB9c: https://github.com/JuliaStrings/utf8proc/blob/a1b99daa2a3393884220264c927a48ba1251a9c6/utf8proc.c#L308
Maybe it was written based on an old utf8proc?
Anyway, my question is, is this intentional? Are you simply ignoring GB9c or is it an oversight that you intend to correct in the future. It would be good to get to some kind of common algorithm for segmentation across terminals.
If it's an oversight, then I recommend you test your implementation against the grapheme break property test suite from the unicode consortium. That should tease out most issues in your implementation.
Beta Was this translation helpful? Give feedback.
All reactions