add AbstractChar supertype of Char by stevengj · Pull Request #26286 · JuliaLang/julia

stevengj · 2018-03-01T21:54:14Z

Fixes #25302. Wherever practical, functions that took ::Char arguments are changed to accept any ::AbstractChar, with fallbacks defined as necessary.

For T<:AbstractChar, UInt32(c::T) is defined to return the Unicode codepoint represented by c (or throw an error if c lies outside of Unicode), and T(x::UInt32) should create a T from the Unicode codepoint x (or throw an error … T may represent only a subset of Unicode). This makes it possible to define generic fallbacks for comparison to Char, output, etcetera. Even ancient character sets like EBCDIC have a well-defined injective mapping into Unicode, so I don't think we sacrifice any useful generality this way.

stevengj · 2018-03-01T22:08:20Z

An exception to my comment above about injectively mapping other character sets into Unicode is the Han unification, in which different character variants in non-Unicode Japanese encodings are mapped to the same codepoint in Unicode (which viewed them as "font" variations rather than semantically distinct characters).

If you want to define an AbstractChar type representing TRON or similar, however, I still don't see a real problem. Conversion to UInt32 will still produce a Unicode codepoint for people who need Unicode data. You will presumably need to implement custom methods for comparison of one TRON character to another, I/O to non-Unicode streams, or any other operation where conversion to Unicode would be lossy, but you probably would have implemented those methods anyway in order to avoid the performance cost of going through Char.

Keno · 2018-03-01T23:08:00Z

One thing we had discussed was whether continuing to use UInt32 to represent a unicode codepoint makes sense or whether, with the introduction of AbstractChar, we shouldn't also have Codepoint <: AbstractChar for that.

StefanKarpinski · 2018-03-01T23:18:05Z

CodePoint not Codepoint – it's two words.

Keno · 2018-03-01T23:18:59Z

SuRe ;)

nalimilan · 2018-03-01T23:20:34Z


+# fallback: other AbstractChar types, by default, are assumed
+#           not to support malformed or overlong encodings.
+ismalformed(c::AbstractChar) = false


Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own AbstractChar type, it's good to think about this question precisely.

All of the proposed AbstractChar types I've ever seen, other than Char, don't support malformed characters. So it makes sense to me to make this the default. You can always override it.

Another reason to make this the default is that if ismalformed(c) returns true, you need to define additional methods to decode the character.

nalimilan · 2018-03-01T23:23:15Z

+isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y))
+==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y)
+hash(x::AbstractChar, h::UInt) =
+    hash_uint64(((UInt32(x) + UInt64(0xd060fad0)) << 32) ⊻ UInt64(h))


Why not use reinterpret, so that invalid codepoints can be hashed (just like for Char)?

AbstractChar need not be reinterpretable to UInt32.

But you could convert to Char first...

Sure, that's a good point I guess, since equality is defined in terms of comparison, to Char, hashing should match.

nalimilan · 2018-03-01T23:25:15Z

-  * Only the index of the first code unit of a `Char` is a valid index
-  * The encoding of a `Char` is independent of what precedes or follows it
+  * Each `AbstractChar` in a string is encoded by one or more code units
+  * Only the index of the first code unit of a `AbstractChar` is a valid index


"an". Same below.

nalimilan · 2018-03-01T23:30:06Z

+            "map(f, s::AbstractString) requires f to return AbstractChar; " *
            "try map(f, collect(s)) or a comprehension instead"))
-        write(out, c′::Char)
+        write(out, c′::AbstractChar)


Do we require all AbstractChar implementations to implement write so that it uses UTF-8? This code relies on this assumption.

There is a fallback write method that works via conversion to Char.

OK. But I guess it should be mentioned in the docstring for AbstractChar too? It could sound natural to somebody implementing e.g. a Latin1Char to write it in ISO-8859-1, which wouldn't be correct.

Alternatively, we could define print to always output UTF-8, and write to output a raw encoded value.

But I tend to think that the I/O encoding should be a property of the stream, not the string or character type, because usually all strings in a given stream should use the same encoding.

100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.

nalimilan · 2018-03-01T23:31:51Z

 If `r` is a function, each occurrence is replaced with `r(s)`
 where `s` is the matched substring (when `pat`is a `Regex` or `AbstractString`) or
-character (when `pat` is a `Char` or a collection of `Char`).
+character (when `pat` is a `AbstractChar` or a collection of `AbstractChar`).


nalimilan · 2018-03-01T23:32:37Z

  * Like C and Java, but unlike most dynamic languages, Julia has a first-class type representing
-    a single character, called `Char`. This is just a special kind of 32-bit primitive type whose numeric
-    value represents a Unicode code point.
+    a single character, called `AbstractChar`. This is just a special kind of 32-bit primitive type whose numeric value represents a Unicode code point.


Not necessarily 32-bit now. Maybe mention Char or say "by default"?

nalimilan · 2018-03-01T23:34:50Z

+"""
+Char
+
+struct InvalidCharError{T<:AbstractChar} <: Exception


Maybe InexactError could be used here? Just an idea.

Possibly, but I'm not sure if that change belongs in this PR.

Sorry, I hadn't realized this type already existed in master.

stevengj · 2018-03-01T23:46:35Z

I'm not sure what CodePoint accomplishes?

Keno · 2018-03-02T07:32:36Z

The idea of codepoint is to make clear what the conversion is about. convert(UInt32, c) works fine technically of course, but the question is do you really want that? E.g. do you want to be able to push!(UInt32[1], 'a')? I'd argue that that operation doesn't really make much sense. One is an integer the other is a codepoint. They don't really have the same operations (e.g. you can do arithmetic on integers, but not on codepoints). It just seems like much larger of a semantic bridge than we usually ascribe to convert. A related, but separate argument is that packages will want a UTF32-style string anyway, for which CodePoint is the natural character type, so it makes sense to pre-define it for operations that operate on codepoints.

stevengj · 2018-03-02T13:26:16Z

@Keno, we could define the constructors Char(int) and UInt32(char) but not convert to avoid accidental conversions. That seems like a sensible thing to me, is consistent with e.g. #16024, and would address your objection without introducing a new type.

(UTF32String in LegacyStrings might want to define a CodePointChar or UTF32Char or something, but I don't see a need for this to be in Base.)

Rather than defining a single new type, it makes more sense to me to define a new function codepoint(c::AbstractChar)::Integer return the codepoint of c in the most appropriate integer type for c (e.g. possibly UInt8 for an ASCII char type), and have codepoint(::Type{<:AbstractChar}) return the corresponding integer type. reinterpret(codepoint(typeof(x)), x) would then return the raw encoded bytes.

StefanKarpinski

Very nice. I approve modulo comments.

StefanKarpinski · 2018-03-02T16:04:11Z

+convert(::Type{T}, x::AbstractChar) where {T<:Number} = T(x)

-rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)
+rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)


It seems a bit off that this supports any kind of Number rather than just Integer. Pre-existing issue, I know.

StefanKarpinski · 2018-03-02T16:06:38Z

-+(x::Integer, y::Char) = y + x
+# fallbacks:
+isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y))
+==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y)


It seems better for the fallback comparisons to be done in UInt32.

I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Char, but not vice versa.

Converting to UInt32 would mean that ASCIIChar('a') == typemax(Char) would throw an InvalidCharError rather than returning false.

Ah, I see what you're saying. Perhaps this then:

isless(x::Char, y::AbstractChar) = isless(x, Char(y)) isless(x::AbstractChar, y::Char) = isless(Char(x), y) isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y)) ==(x::Char, y::AbstractChar) = x == Char(y) ==(x::AbstractChar, y::Char) = Char(x) == y ==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y)

StefanKarpinski · 2018-03-02T16:09:51Z

 const hex_chars = UInt8['0':'9';'a':'z']

-function show(io::IO, c::Char)
+function show(io::IO, c::AbstractChar)


I don't think this is generic because of the call to reinterpret(UInt32, c) – we have no idea what the representation of an abstract character type is – it may not even be 32-bits or a bits type at all. We could extract show_invalid(c::AbstractChar) as a method that characters have to implement.

As noted in another thread, we could use reinterpret(UInt32, Char(c)).

Yes, the reinterpret call is only for invalid chars. I agree that refactoring this seems like a good choice.

StefanKarpinski · 2018-03-02T16:18:08Z

 start(s::AbstractString) = 1
 done(s::AbstractString, i::Integer) = i > ncodeunits(s)
-eltype(::Type{<:AbstractString}) = Char
+eltype(::Type{<:AbstractString}) = Char # some string types may use another AbstractChar


Is there some way we can put an ::eltype(s) type assert somewhere to catch this if it's wrong?

Not sure…where would it go?

StefanKarpinski · 2018-03-02T16:24:20Z

+            "map(f, s::AbstractString) requires f to return AbstractChar; " *
            "try map(f, collect(s)) or a comprehension instead"))
-        write(out, c′::Char)
+        write(out, c′::AbstractChar)


100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.

StefanKarpinski · 2018-03-02T16:28:51Z

-function length(g::GraphemeIterator)
-    c0 = typemax(Char)
+function length(g::GraphemeIterator{S}) where {S}
+    c0 = eltype(S)(0x00000000)


Is '\0' a safe character for this? It's not a malformed character – does the Unicode spec ensure that it can't ever combine with anything?

Yes, \0 has the Grapheme_Cluster_Break property.

(I used \0 here since hopefully every conceivable AbstractChar type can encode this.)

StefanKarpinski · 2018-03-02T16:30:44Z

-    value represents a Unicode code point.
+  * Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing
+    a single character, called `AbstractChar`. The built-in `Char` subtype of `AbstractChar`
+    is a 32-bit primitive type that can represent any Unicode character.


Perhaps mention that Char represents Unicode characters as zero-padded UTF-8 bytes rather than as code point values? This seems like as good a place as any to mention that.

I thought that we don't want to document the Char representation, lest people rely on it not changing?

StefanKarpinski · 2018-03-02T16:31:55Z

 ## [Characters](@id man-characters)

-A `Char` value represents a single character: it is just a 32-bit primitive type with a special literal
+An `AbstractChar` value represents a single character: it is just a 32-bit primitive type with a special literal


No longer accurate since an AbstractChar could have any representation at all. The simplest fix is just to leave this as Char.

stevengj · 2018-03-02T17:36:02Z

print and write have different semantics for other types, so why not for AbstractChar?

write(io, 0x00) and write(io, 0x0000) output the same (isequal) value in different "encodings", whereas print in both cases outputs 0. So maybe it should be the same for AbstractChar. (This is also how UTF16String was read and written by read/write in LegacyStrings, and it seemed pretty convenient).

print(io, x) is defined to output text, and in that case I agree that the encoding should be determined by io (defaulting to UTF-8) and not by x.

(That is, there would be a fallback print(io, x::AbstractChar) = print(io, Char(x)), but no fallback for write or read: implementations would be responsible for providing the latter, and they would be understood to use a type-specific encoding.)

StefanKarpinski · 2018-03-02T17:39:41Z

re: print/write behavior, that seems reasonable to me 👍. @JeffBezanson may have thoughts.

…and write when encoding should be determined by the argument type

…SubString{String}

stevengj · 2018-03-02T19:39:15Z

I updated the PR to make the print/write distinction discussed above.

I also did a pass over the source code and I found lots of write calls that should be print calls if we want the latter to (potentially) have an io-specific encoding. I fixed most of these, since there should be no performance difference. As an exception, there are various cases (e.g. in printing numbers) where we output raw ASCII bytes to the stream with write or unsafe_write. I left these as-is so as not to affect performance, but future revisions should think about how to make this more generic. Improving this in the future shouldn't require breaking changes, though, so it can wait until 1.1 or later.

stevengj · 2018-03-02T22:04:17Z

@Keno, I tried getting rid of the convert functions and ran into trouble right away: base/chars.jl uses const hex_chars = UInt8['0':'9';'a':'z'], which implicitly relies on convert(UInt8, ::Char). Alternatives like UInt8.(['0':'9';'a':'z']) seem much more annoying. So, I'd prefer to leave that change to another PR, if any.

But, as mentioned by @StefanKarpinski above, I think it should be safe to deprecate the implicit Number conversions, leaving only implicit Integer conversions for now.

Keno · 2018-03-02T22:18:55Z

I'm actually much more comfortable with an implicit conversion to UInt8, as long as that conversion is only defined where convert(UInt8, c::Char) == reinterpret(UInt32, c::Char) % UInt8.

Keno · 2018-03-02T22:19:07Z

But fine to do in a different PR.

… boot.jl to char.jl

stevengj · 2018-03-02T22:39:23Z

Added the codepoint(c::Char) function as discussed above.

StefanKarpinski · 2018-03-02T22:47:18Z

-their own implementations of `write` and `read`.
+via `codepoint(char)` will not reveal this encoding because it always returns the
+Unicode value of the character. `print(io, c)` of any `c::AbstractChar`
+produces UTF-8 output by default (via conversion to `Char` if necessary).


Perhaps clarify that it's not necessary that print(io, c) produces UTF-8 but rather that the output encoding is determined by io – and the built-in IO types are all UTF-8.

stevengj · 2018-03-02T23:40:09Z

Note that to really support non-UTF8 IO streams, we'll eventually want to have some kind of trait(io) that gives us information about the encoding.

Not only do we want to know whether it is UTF-8 (so that e.g. the fast path for print(io, ::String) can be used), but we also want to know whether it is any superset of ASCII (so that e.g. the fast path for writing ASCII numeric digits can be used).

That will involve a separate design process, and should be non-breaking as far as I can tell, so it can happen outside of this PR.

stevengj · 2018-03-07T18:27:17Z

AppVeyor CI failure seems to be an unrelated timeout.

stevengj added the unicode Related to unicode characters and encodings label Mar 1, 2018

stevengj requested review from Keno and StefanKarpinski March 1, 2018 21:54

nalimilan reviewed Mar 1, 2018

View reviewed changes

stevengj and others added 3 commits March 2, 2018 08:50

add AbstractChar supertype of Char

63e04bf

NEWS link

07c665d

fixes from feedback

402f9ed

stevengj force-pushed the abstractchar branch from a53a602 to 402f9ed Compare March 2, 2018 14:05

StefanKarpinski approved these changes Mar 2, 2018

View reviewed changes

stevengj added 2 commits March 2, 2018 14:16

use print when io-specific text encoding (usually UTF-8) is desired, …

cc5e445

…and write when encoding should be determined by the argument type

restore optimized print for strings, and analogous optimizations for …

26c6ade

…SubString{String}

updates in response to documentation comments

fbfbcb3

StefanKarpinski requested a review from JeffBezanson March 2, 2018 20:47

add codepoint function and more conversions, and move some stuff from…

954a5df

… boot.jl to char.jl

StefanKarpinski reviewed Mar 2, 2018

View reviewed changes

stevengj mentioned this pull request Mar 2, 2018

custom character literal macros? #26305

Closed

more tests, fixes

bd21bf9

StefanKarpinski merged commit b1b0149 into JuliaLang:master Mar 7, 2018

stevengj deleted the abstractchar branch March 7, 2018 22:17

This was referenced Apr 6, 2018

Change write() to print() JuliaMath/DecFP.jl#68

Merged

Change write() to print() JuliaIO/Formatting.jl#55

Merged

stevengj mentioned this pull request May 17, 2018

add bytes2hex(io, a) method #27124

Merged

kimikage mentioned this pull request Jan 16, 2020

[RFC] Commonize further code between Fixed and Normed JuliaMath/FixedPointNumbers.jl#168

Merged

Keno pushed a commit that referenced this pull request Jun 5, 2024

add AbstractChar supertype of Char (#26286)

619ee4b

Uh oh!

Conversation

stevengj commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj commented Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Keno commented Mar 1, 2018

Uh oh!

StefanKarpinski commented Mar 1, 2018

Uh oh!

Keno commented Mar 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevengj Mar 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevengj Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevengj commented Mar 1, 2018

Uh oh!

Keno commented Mar 2, 2018

Uh oh!

stevengj commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanKarpinski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanKarpinski Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevengj commented Mar 1, 2018 •

edited

Loading

stevengj commented Mar 1, 2018 •

edited

Loading

stevengj Mar 1, 2018 •

edited

Loading

stevengj Mar 2, 2018 •

edited

Loading

stevengj commented Mar 2, 2018 •

edited

Loading

StefanKarpinski Mar 2, 2018 •

edited

Loading

stevengj Mar 2, 2018 •

edited

Loading

stevengj commented Mar 2, 2018 •

edited

Loading

stevengj commented Mar 2, 2018 •

edited

Loading