Skip to content

fix(spark): enable aliased expressions to round-trip#348

Merged
EpsilonPrime merged 1 commit into
substrait-io:mainfrom
andrew-coleman:alias
Apr 4, 2025
Merged

fix(spark): enable aliased expressions to round-trip#348
EpsilonPrime merged 1 commit into
substrait-io:mainfrom
andrew-coleman:alias

Conversation

@andrew-coleman

Copy link
Copy Markdown
Member

A number of the TPC-DS tests were failing because the query contains multiple aliases to the same expression, causing a potential mismatch in the reference index. Although the plans were equivalent, the substrait POJO comparison failed. This commit uses the hint field of the Rel message to store the alias names, and restore them back to the Spark plan to match the original.

case other => (other.output, true)
}
val names = if (project.getHint.isPresent) {
project.getHint.get().getOutputNames.asScala

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the hint.outputNames supposed to be just the column names, or also the possible inner struct field names in DFS form (like the other Substrait "names" fields)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just the column names, used to distinguish different aliases to the same underlying expression. Otherwise the Spark optimiser de-duplicates them causing the round-trip equality check to fail (even though they are equivalent).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring there says

// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.

There's multiple ways to infer that, but "equivalent to the names field in RelRoot" would indicate it should be the DFS listing, including the inner names, since that's what the "names" field in RelRoot is?

(Using it as the DFS would help with the named_struct issue too ;))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Blizzara, I'm not really sure what you want me to do here. Do you have a test case in mind that demonstrates the problem? Thanks :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have the "names" hint contain all of the names, including the inner names in the schema, so that it matches the "names" field on the RelRoot. You can see an example of how I'm planning on doing it for RelRoot here: #342 (I should merge that but it fails for the the test added in #346 as it needs a fix from #315 - it's actually a good catch from that test.)

relation.Project.builder
.remap(remap)
.expressions(expressions.asJava)
.hint(Hint.builder.addAllOutputNames(ToSubstraitType.toNamedStruct(p.schema).names()).build())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update visitExpand below to also use toNamedStruct().names() ?

project.getExpressions.asScala
.map(expr => expr.accept(expressionConverter))
.map(toNamedExpression)
val projectList = if (names.size == projectExprs.size) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so this just skips using the names if there are inner structs? I guess that works, for a bit more complete solution I think you could use ToSparkType.toStructType to construct the expected schema (like here) and then pick the column names from there, and for even more complete do that + reuse the renameAndCastExprs.

A number of the TPC-DS tests were failing because the query
contains multiple aliases to the same expression, causing a
potential mismatch in the reference index.  Although the plans
were equivalent, the substrait POJO comparison failed.
This commit uses the `hint` field of the Rel message to store the
alias names, and restore them back to the Spark plan to match
the original.

@Blizzara Blizzara left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@andrew-coleman

Copy link
Copy Markdown
Member Author

Hi @Blizzara, I notice you've approved but not merged. Is there anything more you need me to do here?
I've got more in the pipeline, but it needs this to go in first :). Many thanks!

@EpsilonPrime EpsilonPrime merged commit 791f7ce into substrait-io:main Apr 4, 2025
@andrew-coleman andrew-coleman deleted the alias branch April 4, 2025 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants