add AbstractChar supertype of Char#26286
Conversation
|
An exception to my comment above about injectively mapping other character sets into Unicode is the Han unification, in which different character variants in non-Unicode Japanese encodings are mapped to the same codepoint in Unicode (which viewed them as "font" variations rather than semantically distinct characters). If you want to define an |
|
One thing we had discussed was whether continuing to use |
|
|
|
SuRe ;) |
|
|
||
| # fallback: other AbstractChar types, by default, are assumed | ||
| # not to support malformed or overlong encodings. | ||
| ismalformed(c::AbstractChar) = false |
There was a problem hiding this comment.
Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own AbstractChar type, it's good to think about this question precisely.
There was a problem hiding this comment.
All of the proposed AbstractChar types I've ever seen, other than Char, don't support malformed characters. So it makes sense to me to make this the default. You can always override it.
Another reason to make this the default is that if ismalformed(c) returns true, you need to define additional methods to decode the character.
| isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y)) | ||
| ==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y) | ||
| hash(x::AbstractChar, h::UInt) = | ||
| hash_uint64(((UInt32(x) + UInt64(0xd060fad0)) << 32) ⊻ UInt64(h)) |
There was a problem hiding this comment.
Why not use reinterpret, so that invalid codepoints can be hashed (just like for Char)?
There was a problem hiding this comment.
AbstractChar need not be reinterpretable to UInt32.
There was a problem hiding this comment.
But you could convert to Char first...
There was a problem hiding this comment.
Sure, that's a good point I guess, since equality is defined in terms of comparison, to Char, hashing should match.
| * Only the index of the first code unit of a `Char` is a valid index | ||
| * The encoding of a `Char` is independent of what precedes or follows it | ||
| * Each `AbstractChar` in a string is encoded by one or more code units | ||
| * Only the index of the first code unit of a `AbstractChar` is a valid index |
| "map(f, s::AbstractString) requires f to return AbstractChar; " * | ||
| "try map(f, collect(s)) or a comprehension instead")) | ||
| write(out, c′::Char) | ||
| write(out, c′::AbstractChar) |
There was a problem hiding this comment.
Do we require all AbstractChar implementations to implement write so that it uses UTF-8? This code relies on this assumption.
There was a problem hiding this comment.
There is a fallback write method that works via conversion to Char.
There was a problem hiding this comment.
OK. But I guess it should be mentioned in the docstring for AbstractChar too? It could sound natural to somebody implementing e.g. a Latin1Char to write it in ISO-8859-1, which wouldn't be correct.
There was a problem hiding this comment.
Alternatively, we could define print to always output UTF-8, and write to output a raw encoded value.
There was a problem hiding this comment.
But I tend to think that the I/O encoding should be a property of the stream, not the string or character type, because usually all strings in a given stream should use the same encoding.
There was a problem hiding this comment.
100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.
| If `r` is a function, each occurrence is replaced with `r(s)` | ||
| where `s` is the matched substring (when `pat`is a `Regex` or `AbstractString`) or | ||
| character (when `pat` is a `Char` or a collection of `Char`). | ||
| character (when `pat` is a `AbstractChar` or a collection of `AbstractChar`). |
| * Like C and Java, but unlike most dynamic languages, Julia has a first-class type representing | ||
| a single character, called `Char`. This is just a special kind of 32-bit primitive type whose numeric | ||
| value represents a Unicode code point. | ||
| a single character, called `AbstractChar`. This is just a special kind of 32-bit primitive type whose numeric value represents a Unicode code point. |
There was a problem hiding this comment.
Not necessarily 32-bit now. Maybe mention Char or say "by default"?
| """ | ||
| Char | ||
|
|
||
| struct InvalidCharError{T<:AbstractChar} <: Exception |
There was a problem hiding this comment.
Maybe InexactError could be used here? Just an idea.
There was a problem hiding this comment.
Possibly, but I'm not sure if that change belongs in this PR.
There was a problem hiding this comment.
Sorry, I hadn't realized this type already existed in master.
|
I'm not sure what |
|
The idea of codepoint is to make clear what the conversion is about. |
|
@Keno, we could define the constructors ( Rather than defining a single new type, it makes more sense to me to define a new function |
StefanKarpinski
left a comment
There was a problem hiding this comment.
Very nice. I approve modulo comments.
| convert(::Type{T}, x::AbstractChar) where {T<:Number} = T(x) | ||
|
|
||
| rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T) | ||
| rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(UInt32(x), T) |
There was a problem hiding this comment.
It seems a bit off that this supports any kind of Number rather than just Integer. Pre-existing issue, I know.
| +(x::Integer, y::Char) = y + x | ||
| # fallbacks: | ||
| isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y)) | ||
| ==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y) |
There was a problem hiding this comment.
It seems better for the fallback comparisons to be done in UInt32.
There was a problem hiding this comment.
I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Char, but not vice versa.
Converting to UInt32 would mean that ASCIIChar('a') == typemax(Char) would throw an InvalidCharError rather than returning false.
There was a problem hiding this comment.
Ah, I see what you're saying. Perhaps this then:
isless(x::Char, y::AbstractChar) = isless(x, Char(y))
isless(x::AbstractChar, y::Char) = isless(Char(x), y)
isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y))
==(x::Char, y::AbstractChar) = x == Char(y)
==(x::AbstractChar, y::Char) = Char(x) == y
==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y)| const hex_chars = UInt8['0':'9';'a':'z'] | ||
|
|
||
| function show(io::IO, c::Char) | ||
| function show(io::IO, c::AbstractChar) |
There was a problem hiding this comment.
I don't think this is generic because of the call to reinterpret(UInt32, c) – we have no idea what the representation of an abstract character type is – it may not even be 32-bits or a bits type at all. We could extract show_invalid(c::AbstractChar) as a method that characters have to implement.
There was a problem hiding this comment.
As noted in another thread, we could use reinterpret(UInt32, Char(c)).
There was a problem hiding this comment.
Yes, the reinterpret call is only for invalid chars. I agree that refactoring this seems like a good choice.
| start(s::AbstractString) = 1 | ||
| done(s::AbstractString, i::Integer) = i > ncodeunits(s) | ||
| eltype(::Type{<:AbstractString}) = Char | ||
| eltype(::Type{<:AbstractString}) = Char # some string types may use another AbstractChar |
There was a problem hiding this comment.
Is there some way we can put an ::eltype(s) type assert somewhere to catch this if it's wrong?
There was a problem hiding this comment.
Not sure…where would it go?
| "map(f, s::AbstractString) requires f to return AbstractChar; " * | ||
| "try map(f, collect(s)) or a comprehension instead")) | ||
| write(out, c′::Char) | ||
| write(out, c′::AbstractChar) |
There was a problem hiding this comment.
100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀') where io is a UTF-16 encoded stream it should definitely not write UTF-8 data to io because '∀' is a UTF-8 encode character type. If you have a CharU16 value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print specializations.
| function length(g::GraphemeIterator) | ||
| c0 = typemax(Char) | ||
| function length(g::GraphemeIterator{S}) where {S} | ||
| c0 = eltype(S)(0x00000000) |
There was a problem hiding this comment.
Is '\0' a safe character for this? It's not a malformed character – does the Unicode spec ensure that it can't ever combine with anything?
There was a problem hiding this comment.
Yes, \0 has the Grapheme_Cluster_Break property.
There was a problem hiding this comment.
(I used \0 here since hopefully every conceivable AbstractChar type can encode this.)
| value represents a Unicode code point. | ||
| * Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing | ||
| a single character, called `AbstractChar`. The built-in `Char` subtype of `AbstractChar` | ||
| is a 32-bit primitive type that can represent any Unicode character. |
There was a problem hiding this comment.
Perhaps mention that Char represents Unicode characters as zero-padded UTF-8 bytes rather than as code point values? This seems like as good a place as any to mention that.
There was a problem hiding this comment.
I thought that we don't want to document the Char representation, lest people rely on it not changing?
| ## [Characters](@id man-characters) | ||
|
|
||
| A `Char` value represents a single character: it is just a 32-bit primitive type with a special literal | ||
| An `AbstractChar` value represents a single character: it is just a 32-bit primitive type with a special literal |
There was a problem hiding this comment.
No longer accurate since an AbstractChar could have any representation at all. The simplest fix is just to leave this as Char.
|
(That is, there would be a fallback |
|
re: |
…and write when encoding should be determined by the argument type
…SubString{String}
|
I updated the PR to make the I also did a pass over the source code and I found lots of |
|
@Keno, I tried getting rid of the But, as mentioned by @StefanKarpinski above, I think it should be safe to deprecate the implicit |
|
I'm actually much more comfortable with an implicit conversion to |
|
But fine to do in a different PR. |
… boot.jl to char.jl
|
Added the |
| their own implementations of `write` and `read`. | ||
| via `codepoint(char)` will not reveal this encoding because it always returns the | ||
| Unicode value of the character. `print(io, c)` of any `c::AbstractChar` | ||
| produces UTF-8 output by default (via conversion to `Char` if necessary). |
There was a problem hiding this comment.
Perhaps clarify that it's not necessary that print(io, c) produces UTF-8 but rather that the output encoding is determined by io – and the built-in IO types are all UTF-8.
|
Note that to really support non-UTF8 Not only do we want to know whether it is UTF-8 (so that e.g. the fast path for That will involve a separate design process, and should be non-breaking as far as I can tell, so it can happen outside of this PR. |
|
AppVeyor CI failure seems to be an unrelated timeout. |
Fixes #25302. Wherever practical, functions that took
::Chararguments are changed to accept any::AbstractChar, with fallbacks defined as necessary.For
T<:AbstractChar,UInt32(c::T)is defined to return the Unicode codepoint represented byc(or throw an error ifclies outside of Unicode), andT(x::UInt32)should create aTfrom the Unicode codepointx(or throw an error …Tmay represent only a subset of Unicode). This makes it possible to define generic fallbacks for comparison toChar, output, etcetera. Even ancient character sets like EBCDIC have a well-defined injective mapping into Unicode, so I don't think we sacrifice any useful generality this way.