Skip to content

fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127

Open
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
RafaelHerrero:fix/union-metadata-intersection-19049
Open

fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
RafaelHerrero:fix/union-metadata-intersection-19049

Conversation

@RafaelHerrero
Copy link

Which issue does this PR close?

Rationale for this change

We're building a SQL engine on top of DataFusion and hit this while running benchmarks. A UNION ALL query against Parquet files that carry field metadata (like PARQUET:field_id or InfluxDB's iox::column::type). When one branch of the union has a NULL literal, intersect_metadata_for_union intersects the metadata from the data source with the empty metadata from the NULL — and since intersecting anything with an empty set gives empty, all metadata gets wiped out.

Later, when optimize_projections prunes columns and recompute_schema rebuilds the Union schema, the logical schema has {} while the physical schema still has the original metadata from Parquet. This causes:

Internal error: Physical input schema should be the same as the one
converted from logical input schema.
Differences:
  - field metadata at index 0 [usage_idle]: (physical) {"iox::column::type": "..."} vs (logical) {}

As @erratic-pattern and @alamb discussed in the issue, empty metadata from NULL literals isn't saying "this field has no metadata" — it's saying "I don't know." It shouldn't erase metadata from branches that actually have it.

I fixed this in intersect_metadata_for_union directly rather than patching optimize_projections or recompute_schema, since that's where the bad intersection happens and it covers all code paths that derive Union schemas.

What changes are included in this PR?

One change to intersect_metadata_for_union in datafusion/expr/src/expr.rs: branches with empty metadata are skipped during intersection instead of participating. Non-empty branches still intersect normally (conflicting values still get dropped). If every branch is empty, the result is empty — same as before.

Are these changes tested?

Added 7 unit tests for intersect_metadata_for_union:

  • Same metadata across branches — preserved
  • Conflicting non-empty values — dropped (existing behavior, unchanged)
  • One branch has metadata, other is empty — metadata preserved (the fix)
  • Empty branch comes first — still works
  • All branches empty — empty result
  • Mix of empty and conflicting non-empty — intersects only the non-empty ones
  • No inputs — empty result

The full end-to-end reproduction needs Parquet files with field metadata as described in the issue. The unit tests cover the intersection logic directly.

Are there any user-facing changes?

No API changes. UNION ALL queries combining metadata-carrying sources with NULL literals will stop failing with schema mismatch errors.

RafaelHerrero and others added 2 commits March 23, 2026 00:12
…chema mismatch

When a UNION ALL combines columns from sources with field metadata
(e.g. Parquet) and NULL literals (which have no metadata), the
intersect_metadata_for_union function was dropping all metadata
because intersecting anything with an empty set yields an empty set.

After optimize_projections prunes unused columns and recompute_schema
rebuilds the Union via Union::try_new, the logical schema ends up
with empty metadata while the physical schema retains the original
field metadata from Parquet, causing a physical/logical schema
mismatch error.

The fix treats empty metadata as a non-vote in the intersection:
branches with no metadata (NULL literals, computed expressions) are
skipped, so only branches with actual metadata participate. When
non-empty branches conflict, their metadata is still correctly
intersected as before.

Closes apache#19049
@adriangb
Copy link
Contributor

Could we add an SLT reproducer?

Add a regression test to metadata.slt that exercises the
optimize_projections column pruning path on a UNION ALL with
NULL literals and a table with field metadata.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 23, 2026
@RafaelHerrero
Copy link
Author

Added an SLT reproducer in metadata.slt. The test uses table_with_metadata (which has field-level metadata) in a UNION ALL with NULL literals, and includes an unused column (id) so that optimize_projections prunes it — triggering the recompute_schema → intersect_metadata_for_union path that was dropping metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with union and optimize_projections: Physical input schema should be the same as the one converted from logical input schema

2 participants