[AURON #1656] Support to disable the scan timestamp for Parquet and ORC formats#1657
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds configuration options to disable timestamp scanning for Parquet and ORC file formats to address calendar conversion issues between Auron and Spark 3. Spark 3 uses the proleptic Gregorian calendar, while historical timestamps may require Julian to Gregorian conversion that Auron doesn't implement, causing inconsistent results.
- Introduces two new configuration flags:
spark.auron.enable.scan.parquet.timestampandspark.auron.enable.scan.orc.timestamp(both default to true) - Adds
existTimestampTypehelper function to recursively detect timestamp types in complex schemas - Updates scan conversion logic to fall back to Spark when timestamp scanning is disabled
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala | Adds existTimestampType function to recursively check for timestamp types in schemas (Arrays, Maps, Structs) |
| spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala | Adds configuration properties for timestamp scanning control and validation logic to trigger fallback when timestamps are detected with scanning disabled |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala
Outdated
Show resolved
Hide resolved
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala
Outdated
Show resolved
Hide resolved
…veConverters.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…veConverters.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala
Outdated
Show resolved
Hide resolved
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala
Outdated
Show resolved
Hide resolved
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala
Outdated
Show resolved
Hide resolved
…nConverters.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nConverters.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| if (!enableScanParquetTimestamp) { | ||
| assert( | ||
| !exec.requiredSchema.exists(e => existTimestampType(e.dataType)), | ||
| s"Parquet scan with timestamp type is not supported for table: ${tableIdentifier |
There was a problem hiding this comment.
spark-sql (default)> set spark.auron.enable.scan.parquet.timestamp=false;
spark.auron.enable.scan.parquet.timestamp false
Time taken: 0.054 seconds, Fetched 1 row(s)
spark-sql (default)> select * from t3_parquet ;
25/11/27 16:34:36 WARN AuronConverters: Falling back exec: FileSourceScanExec: assertion failed: Parquet scan with timestamp type is not supported for table: `spark_catalog`.`default`.`t3_parquet`. Set spark.auron.enable.scan.parquet.timestamp=true to enable timestamp support or remove timestamp columns from the query.
Which issue does this PR close?
Closes #1656
Rationale for this change
Proleptic Gregorian calendar is used instead of Julian + Gregorian in Spark3.
https://issues.apache.org/jira/browse/SPARK-26651
There is a chrono library in Rust that supports proleptic Gregorian calendar.
However, in some timestamps that require Julian to be converted to Gregorian, the results of Spark and Auron may be inconsistent. There is no ready-made conversion implementation in Rust.
https://github.com/apache/spark/blob/3c50fda6f29d95c24b664a32ee41c61f0a19eedb/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala#L525
Spark
Auron
If it is ORC format, we will encounter an overflow error when reading. #1638
What changes are included in this PR?
Introducing two configurations, the default is true.
When set to false, when the scan schema contains timestamp, fallback to Spark implementation.
Are there any user-facing changes?
How was this patch tested?
Auron
nested type test
Auron
Spark