Skip to content

[AURON #1656] Support to disable the scan timestamp for Parquet and ORC formats#1657

Merged
richox merged 8 commits intoapache:masterfrom
cxzl25:support_timestamp_config
Nov 27, 2025
Merged

[AURON #1656] Support to disable the scan timestamp for Parquet and ORC formats#1657
richox merged 8 commits intoapache:masterfrom
cxzl25:support_timestamp_config

Conversation

@cxzl25
Copy link
Copy Markdown
Contributor

@cxzl25 cxzl25 commented Nov 23, 2025

Which issue does this PR close?

Closes #1656

Rationale for this change

Proleptic Gregorian calendar is used instead of Julian + Gregorian in Spark3.

https://issues.apache.org/jira/browse/SPARK-26651

There is a chrono library in Rust that supports proleptic Gregorian calendar.

However, in some timestamps that require Julian to be converted to Gregorian, the results of Spark and Auron may be inconsistent. There is no ready-made conversion implementation in Rust.

https://github.com/apache/spark/blob/3c50fda6f29d95c24b664a32ee41c61f0a19eedb/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/RebaseDateTime.scala#L525

create table t1_parquet (c1 timestamp) stored as parquet;
set spark.sql.parquet.int96RebaseModeInWrite=LEGACY;
insert overwrite t1_parquet values (timestamp '0001-01-01 00:00:00');
select * from t1_parquet;

Spark

0001-01-01 00:00:00

Auron

0000-12-30 00:05:43

If it is ORC format, we will encounter an overflow error when reading. #1638

What changes are included in this PR?

Introducing two configurations, the default is true.

When set to false, when the scan schema contains timestamp, fallback to Spark implementation.

spark.auron.enable.scan.parquet.timestamp=true
spark.auron.enable.scan.orc.timestamp=true

Are there any user-facing changes?

How was this patch tested?

Auron

set spark.auron.enable.scan.parquet.timestamp=false;
0001-01-01 00:00:00

nested type test

create table t2_parquet (c1 string,c2 struct<c3:timestamp>) stored as parquet;
insert overwrite t2_parquet values (timestamp '0001-01-01 00:00:00',named_struct('c3',timestamp '0001-01-01 00:00:00'));
select * from t2_parquet;

Auron

0001-01-01 00:00:00	{"c3":0000-12-30 00:05:43}

Spark

25/11/24 12:17:10 WARN AuronConverters: Falling back exec: FileSourceScanExec: assertion failed: Parquet scan with timestamp type is not supported
0001-01-01 00:00:00	{"c3":0001-01-01 00:00:00}

@github-actions github-actions bot added the spark label Nov 23, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds configuration options to disable timestamp scanning for Parquet and ORC file formats to address calendar conversion issues between Auron and Spark 3. Spark 3 uses the proleptic Gregorian calendar, while historical timestamps may require Julian to Gregorian conversion that Auron doesn't implement, causing inconsistent results.

  • Introduces two new configuration flags: spark.auron.enable.scan.parquet.timestamp and spark.auron.enable.scan.orc.timestamp (both default to true)
  • Adds existTimestampType helper function to recursively detect timestamp types in complex schemas
  • Updates scan conversion logic to fall back to Spark when timestamp scanning is disabled

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala Adds existTimestampType function to recursively check for timestamp types in schemas (Arrays, Maps, Structs)
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala Adds configuration properties for timestamp scanning control and validation logic to trigger fallback when timestamps are detected with scanning disabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cxzl25 and others added 2 commits November 24, 2025 12:42
…veConverters.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…veConverters.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cxzl25 cxzl25 requested a review from richox November 26, 2025 09:06
cxzl25 and others added 3 commits November 27, 2025 16:12
…nConverters.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nConverters.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
if (!enableScanParquetTimestamp) {
assert(
!exec.requiredSchema.exists(e => existTimestampType(e.dataType)),
s"Parquet scan with timestamp type is not supported for table: ${tableIdentifier
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql (default)> set spark.auron.enable.scan.parquet.timestamp=false;
spark.auron.enable.scan.parquet.timestamp	false
Time taken: 0.054 seconds, Fetched 1 row(s)
spark-sql (default)> select * from t3_parquet ;
25/11/27 16:34:36 WARN AuronConverters: Falling back exec: FileSourceScanExec: assertion failed: Parquet scan with timestamp type is not supported for table: `spark_catalog`.`default`.`t3_parquet`. Set spark.auron.enable.scan.parquet.timestamp=true to enable timestamp support or remove timestamp columns from the query.

@richox richox merged commit 059516d into apache:master Nov 27, 2025
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support to disable the scan timestamp for Parquet and ORC formats

4 participants