Skip to content

fix: binary string concat#20787

Open
theirix wants to merge 9 commits intoapache:mainfrom
theirix:binary-to-binaryconcat
Open

fix: binary string concat#20787
theirix wants to merge 9 commits intoapache:mainfrom
theirix:binary-to-binaryconcat

Conversation

@theirix
Copy link
Contributor

@theirix theirix commented Mar 7, 2026

Which issue does this PR close?

Rationale for this change

There is no support for concatenating binary strings.

There are two ways:

  1. Cast binaries to text and then concatenate text type (most common behaviour across DBs as discussed in the ticket)
  2. Concatenate binaries and provide binary type (only for Spark)

This PR goes with a way (1). We could customise the operator/UDF per-dialect, but it is not currently supported.

What changes are included in this PR?

  • Support binary concatenation via a pipe operator
  • Support binary concatenation in concat UDF

Are these changes tested?

Added SLT tests

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 7, 2026
@theirix theirix marked this pull request as ready for review March 7, 2026 22:51
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theirix

Thanks for working on this.

.extend_from_slice(array.value(i).as_bytes());
}
}
ColumnarValueRef::NullableBinaryArray(array) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a safety issue here. By accepting BinaryArray directly in the string builders, concat() can now pass unchecked bytes into StringArrayBuilder::finish, which still relies on build_unchecked() and assumes all data is valid UTF-8.

The scalar path and the || operator both validate binary inputs first, but this new array path skips that step. That means malformed binary values could cause a panic or produce an invalid string array instead of a proper execution error.

Would it make sense to validate or cast binary arrays before they reach these builders? Also, it would be great to add a regression test that covers invalid UTF-8 input so we lock in the expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, @kosiew !
It's a good spot for safety. I refactored the custom builders to enforce checked behaviour when operations even involve binary arrays. Performing validation in the finish method allows concatenating binary strings, which could possibly be split in the middle of code points (see a special SLT case for this). So it is performant and robust.

Also added SLTs for all code paths.

"Concat function does not support scalar type {}",
scalar
)?,
}
}
}
let result = values.concat();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion. The return type precedence (Utf8View > LargeUtf8 > Utf8) is computed both in return_type and again in invoke_with_args.

It might be worth pulling this into a small helper so we do not accidentally update one path and forget the other if new types get added later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code was changed since then. However, I extracted a small helper - not as elegant as expected, but it works fine.

@@ -933,6 +933,18 @@ fn pre_selection_scatter(
}

fn concat_elements(left: &ArrayRef, right: &ArrayRef) -> Result<ArrayRef> {
if *left.data_type() == DataType::Binary && *right.data_type() == DataType::Binary {
// Cast Binary to Utf8 to validate UTF-8 encoding before concatenation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this binary-to-UTF8 normalization live in a shared helper with the concat UDF path?

Right now the operator has its own validation branch while the UDF has separate binary handling, which makes it a little harder to keep error semantics and future binary/string coercion changes aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it cannot be done since a physical expression crate cannot depend on functions crate. UDFs and concat operator were always separated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that datafusion/physical-expr should not depend on datafusion/functions, so a helper cannot simply live in the UDF crate and be reused from the operator.

There are still a way to "keep binary-to-UTF8 normalization semantics aligned between the UDF and || operator." without introducing an illegal dependency:

  • Move shared normalization logic into a lower-level crate both can depend on. The important part is that the shared code lives below both datafusion/functions and datafusion/physical-expr, not in either one.

@theirix theirix force-pushed the binary-to-binaryconcat branch from f4b081b to 469f790 Compare March 21, 2026 21:40
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theirix
Nice catch fixing validation in the StringArrayBuilder and LargeStringArrayBuilder finish paths.

I think there is still one case left here though.

self.block.push_str(array.value(i));
}
}
ColumnarValueRef::NullableBinaryArray(array) => {
Copy link
Contributor

@kosiew kosiew Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat still prefers Utf8View whenever any argument has that type. In this branch, binary inputs are written through StringViewArrayBuilder, and both binary array variants still call std::str::from_utf8(...).unwrap() on the bytes. That means a query like concat(arrow_cast('ok', 'Utf8View'), binary_col) can still panic on malformed UTF-8 instead of returning an execution error.

The new tests cover valid Utf8View + Binary inputs and invalid Binary + Binary inputs, but they do not cover the invalid mixed Utf8View + Binary case that still reaches this unchecked path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I added tests and introduced the same approach to StringViewArrayBuilder, using unsafe only when operations don't taint the buffer. Also, it should be more performant since we do UTF-8 validation once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary string (BYTEA, Binary) concatenation

2 participants