Skip to content

Conversation

@mariotaddeucci
Copy link

When allowMissingColumns=True, the method now correctly handles missing columns from both the left and right DataFrames by:

  • Adding missing columns from the right DataFrame to the left as NULL
  • Ensuring all columns from the left DataFrame are present in the right
  • Properly aligning column order to match Spark's behavior

This ensures the union result contains all columns from both DataFrames, with NULL values where columns are missing, matching PySpark behavior.

When allowMissingColumns=True, the method now correctly handles missing
columns from both the left and right DataFrames by:
- Adding missing columns from the right DataFrame to the left as NULL
- Ensuring all columns from the left DataFrame are present in the right
- Properly aligning column order to match Spark's behavior

This ensures the union result contains all columns from both DataFrames,
with NULL values where columns are missing, matching PySpark behavior.
Copilot AI review requested due to automatic review settings January 2, 2026 23:27
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the unionByName method to properly handle missing columns from both DataFrames when allowMissingColumns=True. Previously, the method only handled missing columns from the right DataFrame, but not from the left one.

Key Changes:

  • Updated the logic to add NULL columns for missing columns from both DataFrames
  • Column order now matches Spark's behavior by prioritizing the left DataFrame's schema
  • Added a test case to verify the reversed scenario works correctly

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
duckdb/experimental/spark/sql/dataframe.py Rewrote the unionByName implementation to handle missing columns bidirectionally and align columns properly before performing the union
tests/fast/spark/test_spark_union_by_name.py Added test case test_union_by_name_allow_missing_cols_rev to verify the fix works when the DataFrame with fewer columns is on the left side

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@evertlammerts
Copy link
Collaborator

Can you fix the linting and formatting errors please? See https://duckdb.org/docs/stable/dev/building/python#3-enable-pre-commit-hooks for guidance.

Copy link
Collaborator

@evertlammerts evertlammerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants