Skip to content

[AURON #1602] Implement AuronAdaptor SPI discovery with Spark provider#1620

Merged
richox merged 2 commits intoapache:masterfrom
yew1eb:AURON_1602
Nov 18, 2025
Merged

[AURON #1602] Implement AuronAdaptor SPI discovery with Spark provider#1620
richox merged 2 commits intoapache:masterfrom
yew1eb:AURON_1602

Conversation

@yew1eb
Copy link
Contributor

@yew1eb yew1eb commented Nov 11, 2025

Which issue does this PR close?

Closes #1602 .

Rationale for this change

Decouples the initialization of the AuronAdaptor instance from the dependency on the implementation of SparkAuronAdaptor through the SPI mechanism.

What changes are included in this PR?

  • Refactor AuronAdaptor.getInstance() to use double-checked locking + ServiceLoader
  • Add AuronAdaptorProvider SPI interface
  • Implement SparkAuronAdaptorProvider in spark-extension
  • Register provider via META-INF/services in spark-extension

Are there any user-facing changes?

No

How was this patch tested?

Unit tests in AuronAdaptorTest:

  • Verify SPI loads MockAuronAdaptor via META-INF/services.

Copy link
Contributor

@Tartarus0zm Tartarus0zm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.
You can add a test to verify whether cases like NativeShuffleExchangeExec.write have been fixed.

"auron.partialAggSkipping.minRows")
.description("minimum number of rows to trigger partial aggregate skipping.")
.intType()
.defaultValue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification will change the default behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this because it causes NPE in SparkAuronConfigurationTest, the UT doesn't initialize SparkEnv, and AuronAdaptor.getAuronConfiguration() depends on it to get SparkConf, triggering NPE.

public AuronConfiguration getAuronConfiguration() {
    return new SparkAuronConfiguration(SparkEnv$.MODULE$.get().conf()); // NPE in UT
}

I'll try to fix the UT first.
Do you have any better suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tartarus0zm
this is a design flaw in SparkAuronConfiguration. Static initialization must not depend on runtime singletons like AuronAdaptor.getInstance().getAuronConfiguration(). because:

  • It creates a circular dependency:
    SparkAuronConfiguration → AuronAdaptor → SparkAuronConfiguration
  • Static defaults must be pure constants

In this PR, remove this code. Then we’ll open a follow-up issue to properly redesign SparkAuronConfiguration. What do you think?”

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this because it causes NPE in SparkAuronConfigurationTest, the UT doesn't initialize SparkEnv, and AuronAdaptor.getAuronConfiguration() depends on it to get SparkConf, triggering NPE.

public AuronConfiguration getAuronConfiguration() {
    return new SparkAuronConfiguration(SparkEnv$.MODULE$.get().conf()); // NPE in UT
}

I'll try to fix the UT first. Do you have any better suggestions?

@yew1eb Thanks for your feedback.
It appears our current configuration design has some shortcomings. Typically, a configuration item's default value should not depend on another configuration item. Such cases are usually resolved through code. This is a legacy issue. I will resolve this issue in #1621 as soon as possible.

@yew1eb
Copy link
Contributor Author

yew1eb commented Nov 11, 2025

Thank you for your contribution. You can add a test to verify whether cases like NativeShuffleExchangeExec.write have been fixed.

@Tartarus0zm A direct test for NativeShuffleExchangeExec.write is ideal but hard to reproduce in unit tests — the local Spark cluster runs driver and executor in the same JVM, so static objects like AuronAdaptor.getInstance() are initialized early during driver planning, preventing the executor-side null → NPE failure seen in #1602.
We added test units to ensure getInstance() never returns null. I also deployed the fix to production and re-ran the same SQL — the NoClassDefFoundError no longer occurs.

Any suggestions for a stronger test? Happy to improve if there's a better approach!

@Tartarus0zm
Copy link
Contributor

Thank you for your contribution. You can add a test to verify whether cases like NativeShuffleExchangeExec.write have been fixed.

@Tartarus0zm A direct test for NativeShuffleExchangeExec.write is ideal but hard to reproduce in unit tests — the local Spark cluster runs driver and executor in the same JVM, so static objects like AuronAdaptor.getInstance() are initialized early during driver planning, preventing the executor-side null → NPE failure seen in #1602. We added test units to ensure getInstance() never returns null. I also deployed the fix to production and re-ran the same SQL — the NoClassDefFoundError no longer occurs.

Any suggestions for a stronger test? Happy to improve if there's a better approach!

@yew1eb Thank you for your clarification.
@merrily01 @richox Should we establish a CI environment with a driver and executor separated to detect such issues earlier?

@yew1eb
Copy link
Contributor Author

yew1eb commented Nov 13, 2025

Rebased this PR -- it depends on the #1621 fix. @richox @merrily01 , could you please review when you have time ? Thanks!

@yew1eb
Copy link
Contributor Author

yew1eb commented Nov 17, 2025

@richox @merrily01 please take a look and let me know if there are any issues.

Copy link
Contributor

@Tartarus0zm Tartarus0zm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richox richox merged commit 58fa81f into apache:master Nov 18, 2025
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NoClassDefFoundError: Could not initialize class org.apache.spark.sql.auron.NativeConverters$ during NativeShuffleExchange in Spark task

3 participants