fix(spark): enable aliased expressions to round-trip#348
Conversation
| case other => (other.output, true) | ||
| } | ||
| val names = if (project.getHint.isPresent) { | ||
| project.getHint.get().getOutputNames.asScala |
There was a problem hiding this comment.
is the hint.outputNames supposed to be just the column names, or also the possible inner struct field names in DFS form (like the other Substrait "names" fields)?
There was a problem hiding this comment.
It's just the column names, used to distinguish different aliases to the same underlying expression. Otherwise the Spark optimiser de-duplicates them causing the round-trip equality check to fail (even though they are equivalent).
There was a problem hiding this comment.
The docstring there says
// Assigns alternative output field names for any relation. Equivalent to the names field
// in RelRoot but applies to the output of the relation this RelCommon is attached to.
There's multiple ways to infer that, but "equivalent to the names field in RelRoot" would indicate it should be the DFS listing, including the inner names, since that's what the "names" field in RelRoot is?
(Using it as the DFS would help with the named_struct issue too ;))
There was a problem hiding this comment.
Hi @Blizzara, I'm not really sure what you want me to do here. Do you have a test case in mind that demonstrates the problem? Thanks :)
There was a problem hiding this comment.
We should have the "names" hint contain all of the names, including the inner names in the schema, so that it matches the "names" field on the RelRoot. You can see an example of how I'm planning on doing it for RelRoot here: #342 (I should merge that but it fails for the the test added in #346 as it needs a fix from #315 - it's actually a good catch from that test.)
b2700b6 to
0577d4a
Compare
| relation.Project.builder | ||
| .remap(remap) | ||
| .expressions(expressions.asJava) | ||
| .hint(Hint.builder.addAllOutputNames(ToSubstraitType.toNamedStruct(p.schema).names()).build()) |
There was a problem hiding this comment.
let's update visitExpand below to also use toNamedStruct().names() ?
| project.getExpressions.asScala | ||
| .map(expr => expr.accept(expressionConverter)) | ||
| .map(toNamedExpression) | ||
| val projectList = if (names.size == projectExprs.size) { |
There was a problem hiding this comment.
ah so this just skips using the names if there are inner structs? I guess that works, for a bit more complete solution I think you could use ToSparkType.toStructType to construct the expected schema (like here) and then pick the column names from there, and for even more complete do that + reuse the renameAndCastExprs.
A number of the TPC-DS tests were failing because the query contains multiple aliases to the same expression, causing a potential mismatch in the reference index. Although the plans were equivalent, the substrait POJO comparison failed. This commit uses the `hint` field of the Rel message to store the alias names, and restore them back to the Spark plan to match the original.
|
Hi @Blizzara, I notice you've approved but not merged. Is there anything more you need me to do here? |
A number of the TPC-DS tests were failing because the query contains multiple aliases to the same expression, causing a potential mismatch in the reference index. Although the plans were equivalent, the substrait POJO comparison failed. This commit uses the
hintfield of the Rel message to store the alias names, and restore them back to the Spark plan to match the original.