Conversation
* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.
…_NFKC_Casefold, add a test
|
@nomoon, you wrote in #102 that unassigned codepoints are "specified as being discarded by NFKC_CF", but I can find no such specification. In section 3.13 (Default Case Algorithms) of the Unicode specification, it says:
According to section 5.21, it says:
and it doesn't seem like unassigned codepoints should be treated as ignorable. Do you have any reference to the contrary? If not, I will remove the |
|
@stevengj It's been a long time since I wrote the PR so I'm not sure where that came from. I'll look if I have the chance. In any event, it would be good to have the option available so as not to have to scrub the string of invalid codepoints as a separate step, since many use-cases of the NFKC_Casefold would a) assume that the string is valid, and b) possibly not properly case-fold if confused by invalid points. |
|
(Note that unassigned != invalid.) |
|
@stevengj Of course. My bad. But yeah either way I can't find where I read that (possibly mis-read the ICU documentation or code). |
Updated version of #102:
IGNOREso that this PR is non-breaking, adds newSTRIPNAflag.utf8proc_NFKC_Casefoldinstead ofutf8proc_NFKC_CFutf8proc_data.cfile.To do:
UTF8PROC_CASEFOLDbefore and after this PR to make sure any changes are in the right direction. No differences found.