Skip to content

Use optimized StringBuilders / BulkNullStringArrayBuilder in more places #22148

@alamb

Description

@alamb

The idea is that now that we have some very optimized string builder APIs that generalize to the three different string types, we can reuse them in multiple kernels

At the moment the code is all in in the datafusion-functions crate: https://github.com/apache/datafusion/blob/7708aa2dc61271423a5c334bd2e2025b5e275133/datafusion/functions/src/strings.rs

However, that means they can't be used in other crates. I suggest we could put the string code in https://github.com/apache/datafusion/blob/0dfcd97a37e083e48aefc5267539ac453cc07b44/datafusion/physical-expr-common

This is consistent with things like String/BinaryMap:
https://github.com/apache/datafusion/blob/0dfcd97a37e083e48aefc5267539ac453cc07b44/datafusion/physical-expr-common/src/binary_map.rs#L40-L39

This might make it easier to and and reuse across crates

As @neilconway says:

Other places where these APIs should be useful:

  • initcap
  • lower, upper: at least for the Unicode code path; for ASCII, we might not beat the hand-optimized code added in perf: Optimize lower, upper for ASCII inputs #21980
  • translate
  • reverse (might need a slightly different API)
  • to_char (might need a small API extension)
  • lpad, rpad (needs a closer look)

If we make the builders accessible outside the current crate, some of the Spark functions could use these APIs, as well as || for Utf8View values.

Originally posted by @neilconway in #22029 (comment)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions