-
Notifications
You must be signed in to change notification settings - Fork 56
Fix unionByName to properly handle missing columns from both DataFrames #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix unionByName to properly handle missing columns from both DataFrames #243
Conversation
When allowMissingColumns=True, the method now correctly handles missing columns from both the left and right DataFrames by: - Adding missing columns from the right DataFrame to the left as NULL - Ensuring all columns from the left DataFrame are present in the right - Properly aligning column order to match Spark's behavior This ensures the union result contains all columns from both DataFrames, with NULL values where columns are missing, matching PySpark behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes the unionByName method to properly handle missing columns from both DataFrames when allowMissingColumns=True. Previously, the method only handled missing columns from the right DataFrame, but not from the left one.
Key Changes:
- Updated the logic to add NULL columns for missing columns from both DataFrames
- Column order now matches Spark's behavior by prioritizing the left DataFrame's schema
- Added a test case to verify the reversed scenario works correctly
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| duckdb/experimental/spark/sql/dataframe.py | Rewrote the unionByName implementation to handle missing columns bidirectionally and align columns properly before performing the union |
| tests/fast/spark/test_spark_union_by_name.py | Added test case test_union_by_name_allow_missing_cols_rev to verify the fix works when the DataFrame with fewer columns is on the left side |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
Can you fix the linting and formatting errors please? See https://duckdb.org/docs/stable/dev/building/python#3-enable-pre-commit-hooks for guidance. |
evertlammerts
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting
When allowMissingColumns=True, the method now correctly handles missing columns from both the left and right DataFrames by:
This ensures the union result contains all columns from both DataFrames, with NULL values where columns are missing, matching PySpark behavior.