[AURON #1898] Fix NULL handling for NOT IN subquery#1899
[AURON #1898] Fix NULL handling for NOT IN subquery#1899cxzl25 merged 1 commit intoapache:masterfrom
Conversation
1303630 to
bb9f600
Compare
|
@cxzl25 PTAL |
There was a problem hiding this comment.
Pull request overview
This pull request aims to fix NULL handling for NOT IN subquery operations (issue #1898) by implementing proper null-aware anti-join semantics. The changes ensure that when a NOT IN subquery contains NULL values, the query returns the correct (empty) result set as per SQL semantics.
Changes:
- Added logic in the Rust semi-join implementation to detect NULL keys in the build side and filter probe rows accordingly
- Added a test case to verify NOT IN subquery behavior with NULL values
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| native-engine/datafusion-ext-plans/src/joins/bhj/semi_join.rs | Implements null-aware anti-join logic by checking for NULL keys in build side and filtering probe rows when NULL keys are present |
| spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronQuerySuite.scala | Adds test case for NOT IN subquery with NULL values to verify correct null-aware anti-join behavior |
Comments suppressed due to low confidence (1)
native-engine/datafusion-ext-plans/src/joins/bhj/semi_join.rs:210
- There's a potential index synchronization issue. When a probe row has a valid key (key_is_valid is true) but build_has_null_keys is also true, the code continues at line 209 without incrementing hashes_idx. However, this valid probe key would have been included in map_values during the lookup phase (lines 175-186). This causes hashes_idx to become out of sync with map_values, potentially leading to incorrect lookups or index out of bounds errors in subsequent iterations.
The fix should ensure that when key_is_valid is true and we early exit due to build_has_null_keys, we still increment hashes_idx. One approach is to move the hashes_idx increment before the probe key lookup, or restructure the logic to handle this case correctly.
if P.mode == Anti
&& P.probe_is_join_side
&& (!key_is_valid || build_has_null_keys) // Filter if probe row is NULL or build side has any NULL
&& self.join_params.is_null_aware_anti_join
{
probed_joined.set(row_idx, true);
continue;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@yew1eb Nice catch, Thank you for your contribution! |
| Seq(row((1, 1)), row((2, 2)), row((3, null))) | ||
| .toDF("a", "b") | ||
| .createOrReplaceTempView("tbl") | ||
| val df = checkSparkAnswer("select * from tbl where a not in (select b from tbl)") |
There was a problem hiding this comment.
== Results ==
!== Correct Answer - 0 == == Spark Answer - 1 ==
!struct<> struct<a:int,b:int>
! [3,null] (QueryTest.scala:244)
Which issue does this PR close?
Closes #1898
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?
How was this patch tested?