Conversation
| .extend_from_slice(array.value(i).as_bytes()); | ||
| } | ||
| } | ||
| ColumnarValueRef::NullableBinaryArray(array) => { |
There was a problem hiding this comment.
I think there is a safety issue here. By accepting BinaryArray directly in the string builders, concat() can now pass unchecked bytes into StringArrayBuilder::finish, which still relies on build_unchecked() and assumes all data is valid UTF-8.
The scalar path and the || operator both validate binary inputs first, but this new array path skips that step. That means malformed binary values could cause a panic or produce an invalid string array instead of a proper execution error.
Would it make sense to validate or cast binary arrays before they reach these builders? Also, it would be great to add a regression test that covers invalid UTF-8 input so we lock in the expected behavior.
There was a problem hiding this comment.
Thank you for the review, @kosiew !
It's a good spot for safety. I refactored the custom builders to enforce checked behaviour when operations even involve binary arrays. Performing validation in the finish method allows concatenating binary strings, which could possibly be split in the middle of code points (see a special SLT case for this). So it is performant and robust.
Also added SLTs for all code paths.
| "Concat function does not support scalar type {}", | ||
| scalar | ||
| )?, | ||
| } | ||
| } | ||
| } | ||
| let result = values.concat(); |
There was a problem hiding this comment.
Small suggestion. The return type precedence (Utf8View > LargeUtf8 > Utf8) is computed both in return_type and again in invoke_with_args.
It might be worth pulling this into a small helper so we do not accidentally update one path and forget the other if new types get added later.
There was a problem hiding this comment.
Code was changed since then. However, I extracted a small helper - not as elegant as expected, but it works fine.
| @@ -933,6 +933,18 @@ fn pre_selection_scatter( | |||
| } | |||
|
|
|||
| fn concat_elements(left: &ArrayRef, right: &ArrayRef) -> Result<ArrayRef> { | |||
| if *left.data_type() == DataType::Binary && *right.data_type() == DataType::Binary { | |||
| // Cast Binary to Utf8 to validate UTF-8 encoding before concatenation | |||
There was a problem hiding this comment.
Could this binary-to-UTF8 normalization live in a shared helper with the concat UDF path?
Right now the operator has its own validation branch while the UDF has separate binary handling, which makes it a little harder to keep error semantics and future binary/string coercion changes aligned.
There was a problem hiding this comment.
Unfortunately, it cannot be done since a physical expression crate cannot depend on functions crate. UDFs and concat operator were always separated.
There was a problem hiding this comment.
You are right that datafusion/physical-expr should not depend on datafusion/functions, so a helper cannot simply live in the UDF crate and be reused from the operator.
There are still a way to "keep binary-to-UTF8 normalization semantics aligned between the UDF and || operator." without introducing an illegal dependency:
- Move shared normalization logic into a lower-level crate both can depend on. The important part is that the shared code lives below both datafusion/functions and datafusion/physical-expr, not in either one.
# Conflicts: # datafusion/functions/src/string/concat_ws.rs
f4b081b to
469f790
Compare
| self.block.push_str(array.value(i)); | ||
| } | ||
| } | ||
| ColumnarValueRef::NullableBinaryArray(array) => { |
There was a problem hiding this comment.
concat still prefers Utf8View whenever any argument has that type. In this branch, binary inputs are written through StringViewArrayBuilder, and both binary array variants still call std::str::from_utf8(...).unwrap() on the bytes. That means a query like concat(arrow_cast('ok', 'Utf8View'), binary_col) can still panic on malformed UTF-8 instead of returning an execution error.
The new tests cover valid Utf8View + Binary inputs and invalid Binary + Binary inputs, but they do not cover the invalid mixed Utf8View + Binary case that still reaches this unchecked path.
There was a problem hiding this comment.
Good catch! I added tests and introduced the same approach to StringViewArrayBuilder, using unsafe only when operations don't taint the buffer. Also, it should be more performant since we do UTF-8 validation once.
Which issue does this PR close?
BYTEA,Binary) concatenation #12709.Rationale for this change
There is no support for concatenating binary strings.
There are two ways:
This PR goes with a way (1). We could customise the operator/UDF per-dialect, but it is not currently supported.
What changes are included in this PR?
concatUDFAre these changes tested?
Added SLT tests
Are there any user-facing changes?
No