Conversation
- New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields
utf8proc.c
Outdated
| Please note that evaluation of GB10 (grapheme breaks between emoji zwj sequences) | ||
| and GB 12/13 (regional indicator code points) require knowledge of previous characters | ||
| and are thus not handled by this function. This may result in an incorrect break before | ||
| and E_Modifier class codepoint and an incorrectly missing break between two |
|
This almost certainly needs version number bumps |
utf8proc.c
Outdated
| tbc == UTF8PROC_BOUNDCLASS_EXTEND) | ||
| *last_boundclass = UTF8PROC_BOUNDCLASS_E_BASE; | ||
| else | ||
| *last_boundclass = tbc; |
There was a problem hiding this comment.
Would be nice to export this logic somehow so that we can use it in the Julia grapheme iterator.
There was a problem hiding this comment.
Mhm, I believe just adding an extra out parameter to give you the override class should be fine.
|
Ok, I've updated the API to expose the state override. Since that's an ABI incompatible change, I've bumped the MAJOR version accordingly. Please review. |
| @@ -19,9 +19,9 @@ UCFLAGS = $(CFLAGS) $(PICFLAG) $(C99FLAG) $(WCFLAGS) -DUTF8PROC_EXPORTS | |||
| # not API compatibility: MAJOR should be incremented whenever *binary* | |||
| # compatibility is broken, even if the API is backward-compatible | |||
| # Be sure to also update these in MANIFEST and CMakeLists.txt! | |||
There was a problem hiding this comment.
Did you forget to update CMakeLists.txt too?
There was a problem hiding this comment.
Note that you also need to update utf8proc.h for the API version change.
|
Maybe better to do this first, then update #68? |
utf8proc.h
Outdated
| * matching the rules in Unicode 8.0.0. | ||
| * | ||
| * @warning If the state parameter is used, `utf8proc_grapheme_break` must be called | ||
| * IN ORDER on ALL potentital breaks in a string. |
|
Do we need to hold off on doing the version bump if we want to do #68 right after? |
|
I think we can do a version bump for this (definitely needed), and if we want to merge further big changes we don't need to change the version again until we do the next release. |
|
Ok. |
|
Good to merge? |
|
There is a warning about an unused variable: lbc_override |
|
Oops, thanks. |
|
I also realized I made this static. Will fix. |
| { | ||
| int lbc_override = lbc; | ||
| if (state && *state != UTF8PROC_BOUNDCLASS_START) | ||
| lbc_override = *state; |
There was a problem hiding this comment.
@Keno, I just noticed that lbc_override seems to never be used. Did you mean to pass lbc_override to grapheme_break_simple?
There was a problem hiding this comment.
It's been a while, but yes, that looks right, esp by looking at the first commit in this PR.
There was a problem hiding this comment.
We already seem to have found that bug though: https://github.com/JuliaLang/utf8proc/blob/master/utf8proc.c#L290
I believe the only substantiative changes to the code required are the new rules in TR29, but please do check the Unicode change log (http://www.unicode.org/versions/Unicode9.0.0/) for other changes that may affect this library.