Skip to content

Update graphemes for Unicode 7#20

Merged
stevengj merged 2 commits intomasterfrom
graphemes
Dec 14, 2014
Merged

Update graphemes for Unicode 7#20
stevengj merged 2 commits intomasterfrom
graphemes

Conversation

@stevengj
Copy link
Member

This fixes #19, and also makes it much easier to implement grapheme iterators in Julia (JuliaLang/julia#9261) by adding a bool utf8proc_grapheme_break(int32_t c1, int32_t c2) function to check for a grapheme break between two codepoints. This allows us to iterate over graphemes in-place, without mapping to a separate string with 0xFF grapheme separators.

Unfortunately, I had to break backwards compatibility by changing the utf8proc_property_t struct to replace the extend:1 field with a boundclass:4 field, where the latter is now read from Unicode's GraphemeBreakProperty.txt file by the updated generator script. I took this opportunity to rearrange the struct to put the bitfields at the end, so that C will not insert alignment padding into the struct; as a consequence, the struct actually got smaller by several bytes.

Once this is merged, I will submit the corresponding patch to the utf8proc folks.

@jiahao, does it look okay to you? cc @StefanKarpinski

@StefanKarpinski
Copy link
Member

👍

@stevengj
Copy link
Member Author

Going to go ahead and merge, then submit upstream.

stevengj added a commit that referenced this pull request Dec 14, 2014
Update graphemes for Unicode 7
@stevengj stevengj merged commit 4f70bbe into master Dec 14, 2014
@stevengj stevengj mentioned this pull request Mar 8, 2015
@stevengj stevengj deleted the graphemes branch June 27, 2015 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

incorrect extended grapheme segmentation

2 participants