fix(spark): enable aliased expressions to round-trip by andrew-coleman · Pull Request #348 · substrait-io/substrait-java

andrew-coleman · 2025-03-20T16:26:05Z

A number of the TPC-DS tests were failing because the query contains multiple aliases to the same expression, causing a potential mismatch in the reference index. Although the plans were equivalent, the substrait POJO comparison failed. This commit uses the hint field of the Rel message to store the alias names, and restore them back to the Spark plan to match the original.

Blizzara · 2025-03-20T18:00:36Z

      case other => (other.output, true)
    }
+    val names = if (project.getHint.isPresent) {
+      project.getHint.get().getOutputNames.asScala


is the hint.outputNames supposed to be just the column names, or also the possible inner struct field names in DFS form (like the other Substrait "names" fields)?

It's just the column names, used to distinguish different aliases to the same underlying expression. Otherwise the Spark optimiser de-duplicates them causing the round-trip equality check to fail (even though they are equivalent).

The docstring there says

// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.

There's multiple ways to infer that, but "equivalent to the names field in RelRoot" would indicate it should be the DFS listing, including the inner names, since that's what the "names" field in RelRoot is?

(Using it as the DFS would help with the named_struct issue too ;))

Hi @Blizzara, I'm not really sure what you want me to do here. Do you have a test case in mind that demonstrates the problem? Thanks :)

We should have the "names" hint contain all of the names, including the inner names in the schema, so that it matches the "names" field on the RelRoot. You can see an example of how I'm planning on doing it for RelRoot here: #342 (I should merge that but it fails for the the test added in #346 as it needs a fix from #315 - it's actually a good catch from that test.)

Blizzara · 2025-04-01T20:09:04Z

    relation.Project.builder
      .remap(remap)
      .expressions(expressions.asJava)
+      .hint(Hint.builder.addAllOutputNames(ToSubstraitType.toNamedStruct(p.schema).names()).build())


let's update visitExpand below to also use toNamedStruct().names() ?

Blizzara · 2025-04-01T20:18:20Z

        project.getExpressions.asScala
          .map(expr => expr.accept(expressionConverter))
-          .map(toNamedExpression)
+      val projectList = if (names.size == projectExprs.size) {


ah so this just skips using the names if there are inner structs? I guess that works, for a bit more complete solution I think you could use ToSparkType.toStructType to construct the expected schema (like here) and then pick the column names from there, and for even more complete do that + reuse the renameAndCastExprs.

A number of the TPC-DS tests were failing because the query contains multiple aliases to the same expression, causing a potential mismatch in the reference index. Although the plans were equivalent, the substrait POJO comparison failed. This commit uses the `hint` field of the Rel message to store the alias names, and restore them back to the Spark plan to match the original.

Blizzara

Thanks!

andrew-coleman · 2025-04-04T10:20:50Z

Hi @Blizzara, I notice you've approved but not merged. Is there anything more you need me to do here?
I've got more in the pipeline, but it needs this to go in first :). Many thanks!

Blizzara reviewed Mar 20, 2025

View reviewed changes

andrew-coleman force-pushed the alias branch 2 times, most recently from b2700b6 to 0577d4a Compare March 26, 2025 15:17

andrew-coleman requested a review from Blizzara March 26, 2025 16:24

andrew-coleman force-pushed the alias branch from 0577d4a to 08bb6df Compare March 31, 2025 11:04

Blizzara reviewed Apr 1, 2025

View reviewed changes

andrew-coleman force-pushed the alias branch from 08bb6df to aa04686 Compare April 2, 2025 09:01

andrew-coleman requested a review from Blizzara April 2, 2025 10:44

Blizzara approved these changes Apr 3, 2025

View reviewed changes

EpsilonPrime approved these changes Apr 4, 2025

View reviewed changes

EpsilonPrime merged commit 791f7ce into substrait-io:main Apr 4, 2025

andrew-coleman deleted the alias branch April 4, 2025 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): enable aliased expressions to round-trip#348

fix(spark): enable aliased expressions to round-trip#348
EpsilonPrime merged 1 commit into
substrait-io:mainfrom
andrew-coleman:alias

andrew-coleman commented Mar 20, 2025

Uh oh!

Blizzara Mar 20, 2025

Uh oh!

andrew-coleman Mar 21, 2025

Uh oh!

Blizzara Mar 21, 2025

Uh oh!

andrew-coleman Mar 24, 2025

Uh oh!

Blizzara Mar 26, 2025

Uh oh!

Blizzara Apr 1, 2025

Uh oh!

Blizzara Apr 1, 2025

Uh oh!

Blizzara left a comment

Uh oh!

andrew-coleman commented Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrew-coleman commented Mar 20, 2025

Uh oh!

Blizzara Mar 20, 2025

Choose a reason for hiding this comment

Uh oh!

andrew-coleman Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Blizzara Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

andrew-coleman Mar 24, 2025

Choose a reason for hiding this comment

Uh oh!

Blizzara Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Blizzara Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Blizzara Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Blizzara left a comment

Choose a reason for hiding this comment

Uh oh!

andrew-coleman commented Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants