[AURON #1646] isNan semantics are aligned with Spark#1647
[AURON #1646] isNan semantics are aligned with Spark#1647zuston merged 12 commits intoapache:masterfrom
isNan semantics are aligned with Spark#1647Conversation
This reverts commit 2312449.
isNan semantics are aligned with Spark
|
@cxzl25 Could you take a look as well? Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR aligns the isNaN function implementation with Spark semantics, addressing issue #1646. The key semantic change is that isnan(null) now returns false instead of propagating null, matching Spark's behavior. The implementation switches from the built-in DataFusion IsNaN function to a custom Spark-compatible implementation.
Key changes:
- Implemented custom
Spark_IsNaNfunction that returnsfalsefor null values instead of propagating nulls - Added
nulls_to_falsehelper utility to convert null values in boolean arrays to false - Updated tests to verify correct null handling and removed computed NaN test case (
log(-3)) due to Parquet round-trip consistency issues
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala | Changed IsNaN registration from built-in function to custom Spark extension |
| spark-extension-shims-spark/src/test/scala/org.apache.auron/AuronFunctionSuite.scala | Enabled IsNaN test (previously ignored), removed computed NaN test case, added more null value test cases |
| native-engine/datafusion-ext-functions/src/spark_isnan.rs | New Spark-compatible IsNaN implementation with correct null semantics (null→false) |
| native-engine/datafusion-ext-functions/src/lib.rs | Registered Spark_IsNaN function and updated error message for consistency |
| native-engine/datafusion-ext-commons/src/arrow/mod.rs | Added boolean module |
| native-engine/datafusion-ext-commons/src/arrow/boolean.rs | New helper utility to convert nulls to false in boolean arrays |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _other => { | ||
| // For non-float arrays, Spark's isnan is effectively false. | ||
| let len = array.len(); | ||
| let out = ScalarValue::Boolean(Some(false)).to_array_of_size(len)?; | ||
| Ok(ColumnarValue::Array(out)) | ||
| } |
There was a problem hiding this comment.
The code handles non-float types (lines 44-49) by returning an array of all false values, but there are no unit tests covering this behavior. Consider adding test cases for non-float types (e.g., integers, strings) to ensure isnan correctly returns false for these types, matching Spark's semantics.
| match sv { | ||
| ScalarValue::Float64(a) => a.map(|x| x.is_nan()).unwrap_or(false), | ||
| ScalarValue::Float32(a) => a.map(|x| x.is_nan()).unwrap_or(false), | ||
| _ => false, |
There was a problem hiding this comment.
The code handles non-float scalar types (line 55) by returning false, but there are no unit tests covering this behavior. Consider adding test cases for non-float scalar types (e.g., integers, strings) to ensure isnan correctly returns false for these types, matching Spark's semantics.
There was a problem hiding this comment.
lgtm.
Merged @ShreyeshArangath thanks for your nice contribution!
Which issue does this PR close?
Closes #1646
Rationale for this change
isnan currently uses math::isnan which propagates nulls and can mismatch Spark semantics. In Spark, isnan(null) must return false, not null. Additionally, computed NaN values like log(-3) may not be handled consistently during Parquet round-trip.
What changes are included in this PR?
new implementation for is_nan native function made for spark
Are there any user-facing changes?
How was this patch tested?