Skip to content

[AURON #1898] Fix NULL handling for NOT IN subquery#1899

Merged
cxzl25 merged 1 commit intoapache:masterfrom
yew1eb:fix_null_not_in_subquery
Jan 16, 2026
Merged

[AURON #1898] Fix NULL handling for NOT IN subquery#1899
cxzl25 merged 1 commit intoapache:masterfrom
yew1eb:fix_null_not_in_subquery

Conversation

@yew1eb
Copy link
Contributor

@yew1eb yew1eb commented Jan 15, 2026

Which issue does this PR close?

Closes #1898

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

@yew1eb yew1eb force-pushed the fix_null_not_in_subquery branch from 1303630 to bb9f600 Compare January 15, 2026 19:14
@yew1eb
Copy link
Contributor Author

yew1eb commented Jan 16, 2026

@cxzl25 PTAL

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to fix NULL handling for NOT IN subquery operations (issue #1898) by implementing proper null-aware anti-join semantics. The changes ensure that when a NOT IN subquery contains NULL values, the query returns the correct (empty) result set as per SQL semantics.

Changes:

  • Added logic in the Rust semi-join implementation to detect NULL keys in the build side and filter probe rows accordingly
  • Added a test case to verify NOT IN subquery behavior with NULL values

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
native-engine/datafusion-ext-plans/src/joins/bhj/semi_join.rs Implements null-aware anti-join logic by checking for NULL keys in build side and filtering probe rows when NULL keys are present
spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronQuerySuite.scala Adds test case for NOT IN subquery with NULL values to verify correct null-aware anti-join behavior
Comments suppressed due to low confidence (1)

native-engine/datafusion-ext-plans/src/joins/bhj/semi_join.rs:210

  • There's a potential index synchronization issue. When a probe row has a valid key (key_is_valid is true) but build_has_null_keys is also true, the code continues at line 209 without incrementing hashes_idx. However, this valid probe key would have been included in map_values during the lookup phase (lines 175-186). This causes hashes_idx to become out of sync with map_values, potentially leading to incorrect lookups or index out of bounds errors in subsequent iterations.

The fix should ensure that when key_is_valid is true and we early exit due to build_has_null_keys, we still increment hashes_idx. One approach is to move the hashes_idx increment before the probe key lookup, or restructure the logic to handle this case correctly.

            if P.mode == Anti
                && P.probe_is_join_side
                && (!key_is_valid || build_has_null_keys) // Filter if probe row is NULL or build side has any NULL
                && self.join_params.is_null_aware_anti_join
            {
                probed_joined.set(row_idx, true);
                continue;
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cxzl25
Copy link
Contributor

cxzl25 commented Jan 16, 2026

@yew1eb Nice catch, Thank you for your contribution!

Seq(row((1, 1)), row((2, 2)), row((3, null)))
.toDF("a", "b")
.createOrReplaceTempView("tbl")
val df = checkSparkAnswer("select * from tbl where a not in (select b from tbl)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  == Results ==
  !== Correct Answer - 0 ==   == Spark Answer - 1 ==
  !struct<>                   struct<a:int,b:int>
  !                           [3,null] (QueryTest.scala:244)

@cxzl25 cxzl25 merged commit db984c1 into apache:master Jan 16, 2026
104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix NULL handling for NOT IN subquery

3 participants