fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127
Open
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
Open
fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127RafaelHerrero wants to merge 4 commits intoapache:mainfrom
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
Conversation
…chema mismatch When a UNION ALL combines columns from sources with field metadata (e.g. Parquet) and NULL literals (which have no metadata), the intersect_metadata_for_union function was dropping all metadata because intersecting anything with an empty set yields an empty set. After optimize_projections prunes unused columns and recompute_schema rebuilds the Union via Union::try_new, the logical schema ends up with empty metadata while the physical schema retains the original field metadata from Parquet, causing a physical/logical schema mismatch error. The fix treats empty metadata as a non-vote in the intersection: branches with no metadata (NULL literals, computed expressions) are skipped, so only branches with actual metadata participate. When non-empty branches conflict, their metadata is still correctly intersected as before. Closes apache#19049
Contributor
|
Could we add an SLT reproducer? |
Add a regression test to metadata.slt that exercises the optimize_projections column pruning path on a UNION ALL with NULL literals and a table with field metadata.
…b.com/RafaelHerrero/datafusion into fix/union-metadata-intersection-19049
Author
|
Added an SLT reproducer in metadata.slt. The test uses table_with_metadata (which has field-level metadata) in a UNION ALL with NULL literals, and includes an unused column (id) so that optimize_projections prunes it — triggering the recompute_schema → intersect_metadata_for_union path that was dropping metadata |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
We're building a SQL engine on top of DataFusion and hit this while running benchmarks. A
UNION ALLquery against Parquet files that carry field metadata (likePARQUET:field_idor InfluxDB'siox::column::type). When one branch of the union has a NULL literal,intersect_metadata_for_unionintersects the metadata from the data source with the empty metadata from the NULL — and since intersecting anything with an empty set gives empty, all metadata gets wiped out.Later, when
optimize_projectionsprunes columns andrecompute_schemarebuilds the Union schema, the logical schema has{}while the physical schema still has the original metadata from Parquet. This causes:As @erratic-pattern and @alamb discussed in the issue, empty metadata from NULL literals isn't saying "this field has no metadata" — it's saying "I don't know." It shouldn't erase metadata from branches that actually have it.
I fixed this in
intersect_metadata_for_uniondirectly rather than patchingoptimize_projectionsorrecompute_schema, since that's where the bad intersection happens and it covers all code paths that derive Union schemas.What changes are included in this PR?
One change to
intersect_metadata_for_unionindatafusion/expr/src/expr.rs: branches with empty metadata are skipped during intersection instead of participating. Non-empty branches still intersect normally (conflicting values still get dropped). If every branch is empty, the result is empty — same as before.Are these changes tested?
Added 7 unit tests for
intersect_metadata_for_union:The full end-to-end reproduction needs Parquet files with field metadata as described in the issue. The unit tests cover the intersection logic directly.
Are there any user-facing changes?
No API changes.
UNION ALLqueries combining metadata-carrying sources with NULL literals will stop failing with schema mismatch errors.