Skip to content

Update data tables to Unicode 7.0.0#9

Merged
stevengj merged 8 commits intomasterfrom
cjh/markdata
Jul 18, 2014
Merged

Update data tables to Unicode 7.0.0#9
stevengj merged 8 commits intomasterfrom
cjh/markdata

Conversation

@jiahao
Copy link
Collaborator

@jiahao jiahao commented Jul 18, 2014

Updates:

  1. Updates the data_generator.rb script. This script now runs on a modern version of ruby (>1.8) and has the hard-coded data tables replaced with file reads from the appropriate Unicode data (UNIDATA) files.
  2. Provides a new Makefile target, update, which automatically downloads the relevant UNIDATA and runs data_generator.rb to produce the file utf8proc_data.c.new.
  3. Updates utf8proc_data.c to the output generated by running make update against UNIDATA v7.0.0

Observations:

  1. There are #defined constants in utf8proc.c which may in principle have changed from v5.0 to v7.0, such as the constants marking the location of Hangul, Unihan, etc. I haven't checked them and it's probably not worth recomputing for each new Unicode version.
  2. It looks like utf8proc implements an internal processing mode called LUMP, which is briefly described in lump.txt. As far as I can tell, this is a custom normalization mode which is separate from the Unicode standard, but I think we'll want to use these.

Closes #1

@stevengj
Copy link
Member

This is great!

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

@stevengj
Copy link
Member

And yes, LUMP is a custom normalization of utf8proc, which we should keep as-is for API compatibility.

@jiahao
Copy link
Collaborator Author

jiahao commented Jul 18, 2014

As a sanity check, if you run on the Unicode 5.0.0 files then does it reproduce the old utf8proc_data.c?

Yes, see #8

stevengj added a commit that referenced this pull request Jul 18, 2014
Update data tables to Unicode 7.0.0
@stevengj stevengj merged commit a5aeb49 into master Jul 18, 2014
@jiahao jiahao deleted the cjh/markdata branch July 18, 2014 17:41
@PallHaraldsson PallHaraldsson mentioned this pull request Oct 24, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

update Unicode tables

2 participants