Skip to content

[Feature][CDC/JDBC] Add option to disable sampling-based sharding#10604

Open
nzw921rx wants to merge 4 commits intoapache:devfrom
nzw921rx:feature/cdc-jdbc_add-disable-sampling
Open

[Feature][CDC/JDBC] Add option to disable sampling-based sharding#10604
nzw921rx wants to merge 4 commits intoapache:devfrom
nzw921rx:feature/cdc-jdbc_add-disable-sampling

Conversation

@nzw921rx
Copy link
Copy Markdown
Collaborator

@nzw921rx nzw921rx commented Mar 16, 2026

Closes #10596

Purpose of this pull request

Add a sample-sharding.enable option for CDC connectors and split.sample-sharding.enable for JDBC connector, allowing users to explicitly disable sampling-based sharding and fall back to iterative (unevenly-sized) chunk splitting.

Scope

This PR targets the shared JDBC/CDC splitter layer and wires the new options into the following factories:

  • JDBC: JdbcSourceFactory
  • CDC: MySQL, Oracle, PostgreSQL, SqlServer

Out of scope: OpenGauss CDC — its OpengaussIncrementalSourceFactory#optionRule() does not yet expose these options. This will be addressed in a separate follow-up PR.

Does this PR introduce any user-facing change?

New optional config:

  • JDBC: split.sample-sharding.enable (Boolean, default true)
  • CDC (MySQL / Oracle / PG / SqlServer): sample-sharding.enable (Boolean, default true)

When set to false, the splitter falls back to unevenly-sized chunk splitting (iterative query approach) regardless of the shard count.

How was this patch tested?

  • Unit test in DynamicChunkSplitterTest: verifies config parsing and default value
  • E2E test in JdbcMysqlSplitIT: verifies that enable=true uses sampling path while enable=false falls back to unevenly-sized chunk splitting, producing different split counts

Checklist

  • Documentation updated (en & zh) for JDBC, MySQL-CDC, Oracle-CDC, PostgreSQL-CDC, SqlServer-CDC
  • No incompatible changes (default true, fully backward compatible)
  • OpenGauss explicitly excluded from scope (follow-up PR)

@nzw921rx
Copy link
Copy Markdown
Collaborator Author

Hi @davidzollo ,

Feature completed

@DanielCarter-stack
Copy link
Copy Markdown

Issue 1: Typo

Location:

  • seatunnel-connectors-v2/connector-cdc/connector-cdc-base/src/main/java/org/apache/seatunnel/connectors/cdc/base/option/JdbcSourceOptions.java:153
  • seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/config/JdbcSourceOptions.java:106

Issue Description:
There is a typo in the configuration description: "The default value is ture." should be "The default value is true."

Potential Risks:

  • Users will see the typo when reading the documentation, affecting professionalism
  • May lead users to question code quality

Impact Scope:

  • All users using CDC and JDBC connectors

Severity: MINOR

Improvement Suggestion:

.withDescription(
    "Whether to enable sampling-based sharding strategy."
        + "When set false, the system should fall back to unevenly-sized chunk splitting (iterative query approach) regardless of the shard count."
        + "The default value is true.");  // Fix typo

Issue 2: Missing E2E tests for CDC

Location:

  • All CDC connectors (MySQL, Oracle, PostgreSQL, SQLServer)

Issue Description:
The PR only added E2E tests for JDBC connectors, but did not add corresponding E2E tests for CDC connectors. Although the code logic is the same as JDBC, CDC has different execution paths, and independent E2E tests would provide better assurance.

Potential Risks:

  • Changes to CDC connectors may have undiscovered integration issues
  • If CDC and JDBC logic diverge in the future, it may not be detected in time

Impact Scope:

  • All CDC connector users

Severity: MINOR

Improvement Suggestion:
It is recommended to add similar E2E tests for CDC connectors, for example by adding test methods in each CDC's E2E test class. However, this is not required because:

  1. Unit tests have already covered the core logic
  2. JDBC E2E tests have verified functional correctness
  3. CDC and JDBC use the same underlying code

Issue 3: Missing unit tests for CDC connectors

Location:

  • Configuration tests for all CDC connectors

Issue Description:
The PR only added unit tests for JDBC connectorstestSampleShardingEnableConfigParsing, and did not add similar configuration parsing tests for CDC connectors.

Potential Risks:

  • Configuration parsing for CDC connectors may have undiscovered bugs
  • Constructor parameter passing for each CDC connector may have issues

Impact Scope:

  • All CDC connectors

Severity: MINOR

Improvement Suggestion:
Although not required (since the configuration parsing logic is in the base class), consider adding similar configuration parsing tests in each CDC connector's test class to ensure completeness.

Copy link
Copy Markdown
Collaborator Author

@nzw921rx nzw921rx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review @davidzollo @chl-wxp

@nzw921rx nzw921rx force-pushed the feature/cdc-jdbc_add-disable-sampling branch from 39732c5 to 415c3a5 Compare April 5, 2026 15:38
@nzw921rx nzw921rx self-assigned this Apr 5, 2026
Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I walked the real runtime path locally before reviewing this: FactoryUtil.createAndPrepareSource -> CDC createSourceConfigFactory / JDBC JdbcSourceConfig.of -> SnapshotSplitAssigner / DynamicChunkSplitter -> the final if (sampleShardingEnable && sampleShardingThreshold < shardCount) gate. The overall direction is correct, and the new flag is indeed honored on the real snapshot splitting path.

I still have one blocking concern before merge:

  1. The new CDC option is wired into runtime parsing, but it is not added to the CDC factories' optionRule() metadata. For example:

    • connector-cdc-mysql/.../MySqlIncrementalSourceFactory.java:56-80
    • connector-cdc-postgres/.../PostgresIncrementalSourceFactory.java:61-74
    • connector-cdc-oracle/.../OracleIncrementalSourceFactory.java:63-80
    • connector-cdc-sqlserver/.../SqlServerIncrementalSourceFactory.java:61-73

    JdbcSourceConfigFactory.fromReadonlyConfig() already reads SAMPLE_SHARDING_ENABLE, and AbstractJdbcSourceChunkSplitter already uses it, so the execution path is connected. But FactoryUtil, connector-check, and the engine option-rule REST metadata all consume factory.optionRule(). Right now the docs advertise the new CDC knob, while the connector metadata still hides it. Please add SAMPLE_SHARDING_ENABLE to the intended CDC factories before merge.

One non-blocking follow-up:

  1. The new tests only cover JDBC. CDC uses a different splitter implementation (AbstractJdbcSourceChunkSplitter), so I would strongly prefer at least one CDC-side regression test for the sample-sharding.enable=false fallback path.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this locally and traced both the JDBC path and the CDC path from config intake into the actual splitters.

The core runtime change is wired correctly:

  • JDBC now reads split.sample-sharding.enable into JdbcSourceConfig and uses it in DynamicChunkSplitter
  • CDC now carries sample-sharding.enable through JdbcSourceConfigFactory / BaseSourceConfig into AbstractJdbcSourceChunkSplitter

So the real sampling decision point is covered.

I still found one contract gap that should be fixed before merge:

  1. The new CDC option is not exposed in the JDBC-based CDC source factories’ optionRule() definitions.
    I checked the current heads for MySQL / PostgreSQL / Oracle / SQLServer / OpenGauss, and their factories still list SAMPLE_SHARDING_THRESHOLD (and in some cases INVERSE_SAMPLING_RATE), but not SAMPLE_SHARDING_ENABLE itself.

That means the runtime can read the option, and the docs now advertise it, but the official connector option contract is still incomplete for those CDC connectors.

I recommend adding SAMPLE_SHARDING_ENABLE to each JDBC-based CDC source factory optionRule().optional(...) list, and then aligning the docs scope as needed (for example OpenGauss, if it is intended to inherit the same capability).

Once that public contract layer is aligned with the runtime wiring, I’m happy to take another look.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fetched the branch locally, ran DynamicChunkSplitterTest, and traced the main split path in AbstractJdbcSourceChunkSplitter / DynamicChunkSplitter.

The new sample-sharding.enable flag is wired through the config objects into the splitter decision path, and I did not spot a blocker in the normal enumeration flow. The direction looks good to me.

Residual risk: the change mainly affects the uneven-distribution planning path, so an extra regression case with sample-sharding.enable = false on a skewed table would make the intent even clearer.

@nzw921rx nzw921rx requested a review from DanielLeens April 9, 2026 14:36
@nzw921rx
Copy link
Copy Markdown
Collaborator Author

nzw921rx commented Apr 9, 2026

I pulled this locally and traced both the JDBC path and the CDC path from config intake into the actual splitters.

The core runtime change is wired correctly:

  • JDBC now reads split.sample-sharding.enable into JdbcSourceConfig and uses it in DynamicChunkSplitter
  • CDC now carries sample-sharding.enable through JdbcSourceConfigFactory / BaseSourceConfig into AbstractJdbcSourceChunkSplitter

So the real sampling decision point is covered.

I still found one contract gap that should be fixed before merge:

  1. The new CDC option is not exposed in the JDBC-based CDC source factories’ optionRule() definitions.
    I checked the current heads for MySQL / PostgreSQL / Oracle / SQLServer / OpenGauss, and their factories still list SAMPLE_SHARDING_THRESHOLD (and in some cases INVERSE_SAMPLING_RATE), but not SAMPLE_SHARDING_ENABLE itself.

That means the runtime can read the option, and the docs now advertise it, but the official connector option contract is still incomplete for those CDC connectors.

I recommend adding SAMPLE_SHARDING_ENABLE to each JDBC-based CDC source factory optionRule().optional(...) list, and then aligning the docs scope as needed (for example OpenGauss, if it is intended to inherit the same capability).

Once that public contract layer is aligned with the runtime wiring, I’m happy to take another look.

done @DanielLeens please help me Review?

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I pulled the latest branch locally and there is still one blocking completeness gap in the new option surface.

The PR adds split.sample-sharding.enable / split.inverse-sampling.rate across the JDBC / CDC splitter path, but OpengaussIncrementalSourceFactory#optionRule() still only exposes split.sample-sharding.threshold in its optional set. The runtime may still read unknown keys from config, but the public factory contract / validation surface for OpenGauss is now inconsistent with the feature that this PR advertises for CDC/JDBC.

Please either wire the new options into the OpenGauss factory as well, or narrow the documented / supported scope of this PR.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-pulled the current HEAD locally and rechecked the actual option flow from the public connector contract into the splitter path.

Current runtime chain:
FactoryUtil / optionRule -> JdbcSourceConfigFactory -> BaseSourceConfig / JdbcSourceConfig -> AbstractJdbcSourceChunkSplitter or DynamicChunkSplitter

The main blocker still remains on the public contract surface:

  • connector-cdc-opengauss/.../OpengaussIncrementalSourceFactory.java:57-81 still exposes only SAMPLE_SHARDING_THRESHOLD in optionRule()
  • while JDBC-based CDC factories such as connector-cdc-mysql/.../MySqlIncrementalSourceFactory.java:72-81 already publish SAMPLE_SHARDING_ENABLE (and the related new options)

So the execution path may be wired in shared code, but the user-visible / validation / metadata contract is still incomplete for OpenGauss. For a PR that advertises the new CDC/JDBC sampling controls, that is still a merge blocker from my side.

Please either:

  1. wire the OpenGauss factory option surface to the same feature set, or
  2. explicitly narrow the documented and supported scope of this PR.

Separately, the current Build check is still red.

@nzw921rx
Copy link
Copy Markdown
Collaborator Author

I re-pulled the current HEAD locally and rechecked the actual option flow from the public connector contract into the splitter path.

Current runtime chain: FactoryUtil / optionRule -> JdbcSourceConfigFactory -> BaseSourceConfig / JdbcSourceConfig -> AbstractJdbcSourceChunkSplitter or DynamicChunkSplitter

The main blocker still remains on the public contract surface:

  • connector-cdc-opengauss/.../OpengaussIncrementalSourceFactory.java:57-81 still exposes only SAMPLE_SHARDING_THRESHOLD in optionRule()
  • while JDBC-based CDC factories such as connector-cdc-mysql/.../MySqlIncrementalSourceFactory.java:72-81 already publish SAMPLE_SHARDING_ENABLE (and the related new options)

So the execution path may be wired in shared code, but the user-visible / validation / metadata contract is still incomplete for OpenGauss. For a PR that advertises the new CDC/JDBC sampling controls, that is still a merge blocker from my side.

Please either:

  1. wire the OpenGauss factory option surface to the same feature set, or
  2. explicitly narrow the documented and supported scope of this PR.

Separately, the current Build check is still red.

thanks,

Revise the PR description to use Plan 2 and explicitly narrow down the PR scope to exclude OpenGauss

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-pulled the latest HEAD locally and rechecked the public option surface from the connector contract into the splitter path.

Current runtime chain:
Factory optionRule -> JdbcSourceConfigFactory / JdbcSourceConfig -> shared splitter (AbstractJdbcSourceChunkSplitter / DynamicChunkSplitter)

Compared with the revision I previously blocked:

  • the in-scope CDC factories on the current branch now expose SAMPLE_SHARDING_ENABLE in their option surface (MySQL / Oracle / PostgreSQL / SQLServer)
  • the PR body now explicitly narrows scope and calls out OpenGauss as out of scope for a follow-up PR

So the blocker from my previous review is addressed in the current revision.

I still see one non-code item to watch: the current Build check is red, and the failed item is the unit-test shard Run / unit-test (11, ubuntu-latest) on the contributor workflow. I could identify the failing shard from the check metadata, but I did not find a new code blocker in the changed files themselves.

zhiwei.niu and others added 3 commits April 13, 2026 12:49
…sampling

Rename the config option per reviewer suggestion:
- CDC: sample-sharding.enable -> split.allow-sampling
- JDBC: split.sample-sharding.enable -> split.allow-sampling

Made-with: Cursor
@nzw921rx nzw921rx force-pushed the feature/cdc-jdbc_add-disable-sampling branch from 4f93322 to ad533ef Compare April 13, 2026 04:50
Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-pulled the latest HEAD locally and rechecked the current JDBC / CDC split-planning path on the new revision.

Current runtime chain:
factory optionRule
-> JdbcSourceConfig / JdbcSourceConfigFactory
-> shared splitter gate in DynamicChunkSplitter and AbstractJdbcSourceChunkSplitter
-> if (sampleShardingEnable && sampleShardingThreshold < shardCount) decide sampling vs iterative uneven splitting.

On the current HEAD, the latest delta is mainly the user-facing rename to split.allow-sampling, and that rename is wired consistently through the public option surface and runtime path:

  • JDBC option definition: connector-jdbc/.../JdbcSourceOptions.java:99-106
  • CDC option definition: connector-cdc-base/.../JdbcSourceOptions.java:146-153
  • JDBC splitter gate: DynamicChunkSplitter.java:299-317
  • CDC splitter gate: AbstractJdbcSourceChunkSplitter.java:167-190

I also rechecked the contract layer that I previously blocked. The in-scope CDC factories still expose the new option surface on the current HEAD, while OpenGauss remains explicitly narrowed out of scope for a follow-up.

Conclusion: mergeable from a code-review perspective.
Blocking items: none in the current diff.
Suggested follow-up: none from me. The remaining item is the queued Build check.

Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the latest head locally again and checked the delta since my previous approval.

Current runtime chain:

factory optionRule
  -> JdbcSourceConfig / JdbcSourceConfigFactory
  -> shared splitter gate in DynamicChunkSplitter and AbstractJdbcSourceChunkSplitter
  -> if (sampleShardingEnable && sampleShardingThreshold < shardCount)
       sampling path
     else
       iterative uneven-split path

The newest commit after my previous review is formatting-only in JdbcSourceConfigFactory.java; it does not change the public option contract or the sampling-gate semantics. So my earlier code-level conclusion still stands: I do not see a new blocker in the current diff.

The remaining blocker is CI, not the code path itself: the current Build check is red. Please rerun or fix that workflow before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][CDC/JDBC] Add option to disable sampling-based sharding

5 participants