Incomplete grapheme segmentation #6931

kovidgoyal · 2025-03-27T08:10:01Z

kovidgoyal
Mar 27, 2025

Hi, I am the developer of kitty, and I was looking into implementing more complete grapheme segmentation in kitty. While researching I discovered that ghostty claims to implement full UAX 29 grapheme clustering, at least as per my reading of: https://mitchellh.com/writing/grapheme-clusters-in-terminals

I was rather surprised to read this claim, since full UAX29 grapheme clustering is dog slow, with a lot of branches or alternately a lookup table approach. So I investigated ghostty's code and came across:

ghostty/src/unicode/grapheme.zig

Line 6 in 1c3693c

/// Determines if there is a grapheme break between two codepoints. This

I am not familiar with zig but if I am reading that right, you are doing segmentation based on a lookup table with just two bits of state + 8 bits for prev and next char grapheme break property. This is impossible. You seem to be completely ignoring indic conjunct breaks, which are needed for rule GB9c: https://www.unicode.org/reports/tr29/#GB9c

The code claims to be based on utf8proc, but utf8proc supports GB9c: https://github.com/JuliaStrings/utf8proc/blob/a1b99daa2a3393884220264c927a48ba1251a9c6/utf8proc.c#L308

Maybe it was written based on an old utf8proc?

Anyway, my question is, is this intentional? Are you simply ignoring GB9c or is it an oversight that you intend to correct in the future. It would be good to get to some kind of common algorithm for segmentation across terminals.

If it's an oversight, then I recommend you test your implementation against the grapheme break property test suite from the unicode consortium. That should tease out most issues in your implementation.

pluiedev · 2025-03-27T08:31:02Z

pluiedev
Mar 27, 2025
Collaborator

Hi, I'm not exactly an expert on this topic but I can at least address a few points from a cursory search:

We haven't been directly using utf8proc since 132fbb3 — if you grep for utf8proc the only use for it is within pkg/.
It looks like utf8proc v2.8.0 (which is presumably the version the Zig port is based on, as it is the version found within pkg/) does not have any explicit logic related to GB9c. It was rectified in v2.9.0 (JuliaStrings/utf8proc@46a442b) but we appear to not have ported that logic over yet.

Overall Ghostty's support for Indic scripts is very incomplete (see #5637), mostly because nobody on the team can read and write in an Indic language — in fact it's more accurate to say that the vast majority of the team only speaks US English and nothing else. We do have better CJK support since I speak Chinese and Mitchell speaks some Japanese, but I do realize this is not really ideal in terms of linguistic diversity. Some pointers towards better Indic support would definitely be greatly appreciated.

0 replies

mitchellh · 2025-03-27T14:09:32Z

mitchellh
Mar 27, 2025
Maintainer

Thanks for pointing this out. As @pluiedev noted, I believe our implementation and verification was based on v2.8 which didn't have this. Specifically, I think Ghostty mostly supports Unicode 15.0 with an incomplete support of later versions (e.g. we support some of the new glyphs in Unicode 16 in our font sprite but not all).

We previously verified our grapheme break using a verification program that ran through the entire u21 x u21 state space (all that was necessary at the time) and compared our implementation against utf8proc. This passed then. It would probably fail now, but I should probably revive that to verify with later versions of Unicode.

Let me look into this more. Do you happen to have an exact sequence of codepoints that would fail this in Ghostty?

0 replies

kovidgoyal · 2025-03-27T15:15:14Z

kovidgoyal
Mar 27, 2025
Author

On Thu, Mar 27, 2025 at 07:09:53AM -0700, Mitchell Hashimoto wrote: Let me look into this more. Do you happen to have an exact sequence of codepoints that would fail this in Ghostty?

Download https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt and use it to test your implementation directly rather than testing against utf8proc, which isnt up-to-date on the unicode spec anyway, it stops at unicode 15.1 IIRC. Indeed, briefly looking over the algorithm utf8proc implements it doesn't *seem* to be spec compliant but that was after a quick look through. It may be that their implementation is degenerate with the spec and passes the tests, cant say without more study. It's certainly not a direct implementation of the spec and seems to be sort of bolted on after the fact. They have a simple_grapheme_break and then a grapheme_break_extended that calls simple_grapheme_break. I guess they did it this way for historical reasons? For an actual clean implementation of the spec, look at libunistring instead, though IIRC that's GPL licensed so wont work for you. Anyway, I will most likely implement full UAX29 segmentation in kitty (assuming I can get it performant enough) and also probably develop a kitten that can be used to test terminal emulator compliance, based on the above test data, you (and other terminal developers) should be able to use that to check your implementation as well. I am rather keen to move the ecosystem to a single standard, as I am heartily sick of bug reports about width. The correct solution is, of course, the width part of the kitty text sizing protocol, but, it will be a long time before that sees widespread adoption, in the meantime I would like to see terminal emulators converge to a single algorithm for width and segmentation. There is still the problem that with every new version of the spec widths and segmentation properties change, but at least, the algorithm should be stable. Hopefully, once there is a straightforward way to test compliance, the ecosystem will move.

1 reply

kovidgoyal Mar 27, 2025
Author

Sorry got the link to the test data wrong:
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt

kovidgoyal · 2025-04-12T09:16:53Z

kovidgoyal
Apr 12, 2025
Author

FYI, this is now implemented in kitty along with the promised test kitten and a spec describing the exact algorithm. See kovidgoyal/kitty#8533

3 replies

00-kat Apr 12, 2025
Collaborator

I'm going to leave this Discussion as open if you don't mind, since presumably it would be desirable for Ghostty to fix the issue too.

kovidgoyal Apr 12, 2025
Author

Yes, of course, it's just closed as far as I am concerned, since my question was answered.

mitchellh Apr 12, 2025
Maintainer

Indeed. Thanks Kovid. We'll get our Unicode implementation updated as soon as we build up a better test harness to verify updates beyond Unicode 15.

Incomplete grapheme segmentation #6931

Uh oh!

kovidgoyal Mar 27, 2025

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

pluiedev Mar 27, 2025 Collaborator

Uh oh!

mitchellh Mar 27, 2025 Maintainer

Uh oh!

Uh oh!

kovidgoyal Mar 27, 2025 Author

Uh oh!

kovidgoyal Mar 27, 2025 Author

Uh oh!

kovidgoyal Apr 12, 2025 Author

Uh oh!

00-kat Apr 12, 2025 Collaborator

Uh oh!

kovidgoyal Apr 12, 2025 Author

Uh oh!

mitchellh Apr 12, 2025 Maintainer

kovidgoyal
Mar 27, 2025

Replies: 4 comments 4 replies

pluiedev
Mar 27, 2025
Collaborator

mitchellh
Mar 27, 2025
Maintainer

kovidgoyal
Mar 27, 2025
Author

kovidgoyal Mar 27, 2025
Author

kovidgoyal
Apr 12, 2025
Author

00-kat Apr 12, 2025
Collaborator

kovidgoyal Apr 12, 2025
Author

mitchellh Apr 12, 2025
Maintainer