Case folding fixes by stevengj · Pull Request #133 · JuliaStrings/utf8proc

stevengj · 2018-04-30T02:24:37Z

Updated version of #102:

Restores the original behavior of IGNORE so that this PR is non-breaking, adds new STRIPNA flag.
Renames the new function to utf8proc_NFKC_Casefold instead of utf8proc_NFKC_CF
Adds a minimal test.
Updates the utf8proc_data.c file.

To do:

Compare the result of UTF8PROC_CASEFOLD before and after this PR to make sure any changes are in the right direction. No differences found.

* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.

…_NFKC_Casefold, add a test

stevengj · 2018-04-30T02:35:48Z

@nomoon, you wrote in #102 that unassigned codepoints are "specified as being discarded by NFKC_CF", but I can find no such specification.

In section 3.13 (Default Case Algorithms) of the Unicode specification, it says:

A modified form of Default Case Folding is designed for best behavior when doing caseless matching of strings interpreted as identifiers. This folding is based on Case_Folding(C), but also removes any characters which have the Unicode property value Default_Ignorable_Code_Point=True. It also maps characters to their NFKC equivalent sequences. Once the mapping for a string is complete, the resulting string is then normal- ized to NFC. That last normalization step simplifies the statement of the use of this folding for caseless matching.

According to section 5.21, it says:

The default ignorable code points are listed in DerivedCoreProperties.txt in the Unicode Character Database with the property Default_Ignorable_Code_Point.

and it doesn't seem like unassigned codepoints should be treated as ignorable.

Do you have any reference to the contrary? If not, I will remove the STRIPNA flag from NFKC_Casefold (but I will leave the flag in the API, since some people may want this transformation).

nomoon · 2018-04-30T17:53:16Z

@stevengj It's been a long time since I wrote the PR so I'm not sure where that came from. I'll look if I have the chance. In any event, it would be good to have the option available so as not to have to scrub the string of invalid codepoints as a separate step, since many use-cases of the NFKC_Casefold would a) assume that the string is valid, and b) possibly not properly case-fold if confused by invalid points.

stevengj · 2018-04-30T18:31:27Z

(Note that unassigned != invalid.)

nomoon · 2018-04-30T21:49:58Z

@stevengj Of course. My bad. But yeah either way I can't find where I read that (possibly mis-read the ICU documentation or code).

nomoon and others added 7 commits April 29, 2018 21:43

Document the changes to UTF8PROC_IGNORE in header.

8e317d0

Add NFKC_CF helper function with documentation.

a96c9b4

restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc…

edff036

…_NFKC_Casefold, add a test

success message

f27c5a9

test that IGNORE does not strip NA

144f90d

data update

56f3b1b

stevengj mentioned this pull request Apr 30, 2018

Fixes allowing for “Full” folding and NFKC_CaseFold compliance. #102

Closed

NFKC_Casefold shouldn't strip NA

da72c45

stevengj merged commit bdc8b9e into master May 2, 2018

stevengj deleted the case_folding_fixes_new branch May 2, 2018 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case folding fixes#133

Case folding fixes#133
stevengj merged 8 commits intomasterfrom
case_folding_fixes_new

stevengj commented Apr 30, 2018 •

edited

Loading

Uh oh!

stevengj commented Apr 30, 2018

Uh oh!

nomoon commented Apr 30, 2018

Uh oh!

stevengj commented Apr 30, 2018

Uh oh!

nomoon commented Apr 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stevengj commented Apr 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj commented Apr 30, 2018

Uh oh!

nomoon commented Apr 30, 2018

Uh oh!

stevengj commented Apr 30, 2018

Uh oh!

nomoon commented Apr 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevengj commented Apr 30, 2018 •

edited

Loading