feat: Support Utf8View in JSON reader#7263
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you very much @zhuqi-lucas!
I think this PR just needs
-
some adjustment on data allocation
-
A performance benchmark. Perhaps you could extend this one:
| assert_eq!(col1.null_count(), 2); | ||
| assert_eq!(col1.value(0), "1"); | ||
| assert_eq!(col1.value(1), "hello"); | ||
| assert_eq!(col1.value(2), "\nfoobar😀asfgÿ"); |
There was a problem hiding this comment.
this value has more than 12 bytes so it will exercise the longer string view path
| TapeElement::String(idx) => { | ||
| data_capacity += tape.get_string(idx).len(); | ||
| } | ||
| TapeElement::Null => { /* 不增加容量 */ } |
There was a problem hiding this comment.
would it be possible to use english in the comments?
| fn decode(&mut self, tape: &Tape<'_>, pos: &[u32]) -> Result<ArrayData, ArrowError> { | ||
| let coerce = self.coerce_primitive; | ||
| let mut data_capacity = 0; | ||
| for &p in pos { |
There was a problem hiding this comment.
StringView is different that StringArray in that only "long" strings (longer than 12 bytes) contributed to the data
Thus I think these calculations should be adjusted:
- only increase capacity for strings if the data is over 12 bytes
- don't increase for boolean
- For I32 probably we can use zero as well (as the longest such value is
-2147483647whcih is less than 12 bytes) - For I64 maybe we could be more sophisticated and only add data length if the value is over
999999999999etc. - For F32 and F64 I am not sure what hte maximum length of a string representation is so we should probably keep the existing estimate
More details on the layout are here: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
There was a problem hiding this comment.
Good explain and guide, thank you @alamb!
| _ => unreachable!(), | ||
| }, | ||
| TapeElement::I32(n) if coerce => { | ||
| builder.append_value(n.to_string()); |
There was a problem hiding this comment.
This allocation is quite unfortunate (to_string() allocates a string) but I see this is consistent with the other JSON string implementation:
|
(BTW the MSRV test is fixed on main thanks to @tustvold -- you should be able to merge up from main and the CI will pass) |
Thank you @alamb for review, addressed the comments now. And also the benchmark result for utf8view seems better: small_bench_primitive time: [6.2293 µs 6.2540 µs 6.2810 µs]
change: [+0.6034% +1.1278% +1.6922%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
7 (7.00%) high mild
3 (3.00%) high severesmall_bench_primitive_with_utf8view
time: [6.0220 µs 6.0420 µs 6.0649 µs]
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) low mild
2 (2.00%) high mild |
alamb
left a comment
There was a problem hiding this comment.
Thank you very much @zhuqi-lucas -- I think this is a really nice contribution.
I left some small code comment suggestions but I can also add them as a follow on PR as well
Thanks again!
| let coerce = self.coerce_primitive; | ||
| let mut data_capacity = 0; | ||
| for &p in pos { | ||
| match tape.get(p) { |
There was a problem hiding this comment.
I think a little context on the rationale here would be helpful:
| match tape.get(p) { | |
| // note that StringView is different that StringArray in that only | |
| // "long" strings (longer than 12 bytes) are stored in the buffer. | |
| // "short" strings are inlined into a fixed length structure. | |
| match tape.get(p) { |
| data_capacity += s.len(); | ||
| } | ||
| } | ||
| // For I64, only add capacity if the absolute value is greater than 999,999,999,999 |
There was a problem hiding this comment.
| // For I64, only add capacity if the absolute value is greater than 999,999,999,999 | |
| // For I64, only add capacity if the absolute value is greater than 999,999,999,999 | |
| // (the largest number that can fit in 12 bytes) |
| TapeElement::Null => { | ||
| // Do not increase capacity for null values | ||
| } | ||
| // For booleans, do not increase capacity |
There was a problem hiding this comment.
| // For booleans, do not increase capacity | |
| // For booleans, do not increase capacity (both "true" and "false" are less than | |
| // 12 bytes) |
| TapeElement::I32(low) => { | ||
| let val = ((high as i64) << 32) | (low as u32) as i64; | ||
| tmp_buf.clear(); | ||
| // Reuse the temporary buffer instead of allocating a new String |
Thank you @alamb for review! I also addressed the suggestions in latest PR, thanks! |
|
Thanks again @zhuqi-lucas |
Which issue does this PR close?
Closes #7244
Rationale for this change
Support Utf8View in JSON reader
What changes are included in this PR?
Support Utf8View in JSON reader
Are there any user-facing changes?
Support Utf8View in JSON reader