[Feature][CDC/JDBC] Add option to disable sampling-based sharding#10604
[Feature][CDC/JDBC] Add option to disable sampling-based sharding#10604nzw921rx wants to merge 4 commits intoapache:devfrom
Conversation
|
Hi @davidzollo , Feature completed |
Issue 1: TypoLocation:
Issue Description: Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestion: .withDescription(
"Whether to enable sampling-based sharding strategy."
+ "When set false, the system should fall back to unevenly-sized chunk splitting (iterative query approach) regardless of the shard count."
+ "The default value is true."); // Fix typoIssue 2: Missing E2E tests for CDCLocation:
Issue Description: Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestion:
Issue 3: Missing unit tests for CDC connectorsLocation:
Issue Description: Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestion: |
39732c5 to
415c3a5
Compare
DanielLeens
left a comment
There was a problem hiding this comment.
I walked the real runtime path locally before reviewing this: FactoryUtil.createAndPrepareSource -> CDC createSourceConfigFactory / JDBC JdbcSourceConfig.of -> SnapshotSplitAssigner / DynamicChunkSplitter -> the final if (sampleShardingEnable && sampleShardingThreshold < shardCount) gate. The overall direction is correct, and the new flag is indeed honored on the real snapshot splitting path.
I still have one blocking concern before merge:
-
The new CDC option is wired into runtime parsing, but it is not added to the CDC factories'
optionRule()metadata. For example:connector-cdc-mysql/.../MySqlIncrementalSourceFactory.java:56-80connector-cdc-postgres/.../PostgresIncrementalSourceFactory.java:61-74connector-cdc-oracle/.../OracleIncrementalSourceFactory.java:63-80connector-cdc-sqlserver/.../SqlServerIncrementalSourceFactory.java:61-73
JdbcSourceConfigFactory.fromReadonlyConfig()already readsSAMPLE_SHARDING_ENABLE, andAbstractJdbcSourceChunkSplitteralready uses it, so the execution path is connected. ButFactoryUtil,connector-check, and the engine option-rule REST metadata all consumefactory.optionRule(). Right now the docs advertise the new CDC knob, while the connector metadata still hides it. Please addSAMPLE_SHARDING_ENABLEto the intended CDC factories before merge.
One non-blocking follow-up:
- The new tests only cover JDBC. CDC uses a different splitter implementation (
AbstractJdbcSourceChunkSplitter), so I would strongly prefer at least one CDC-side regression test for thesample-sharding.enable=falsefallback path.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled this locally and traced both the JDBC path and the CDC path from config intake into the actual splitters.
The core runtime change is wired correctly:
- JDBC now reads
split.sample-sharding.enableintoJdbcSourceConfigand uses it inDynamicChunkSplitter - CDC now carries
sample-sharding.enablethroughJdbcSourceConfigFactory/BaseSourceConfigintoAbstractJdbcSourceChunkSplitter
So the real sampling decision point is covered.
I still found one contract gap that should be fixed before merge:
- The new CDC option is not exposed in the JDBC-based CDC source factories’
optionRule()definitions.
I checked the current heads for MySQL / PostgreSQL / Oracle / SQLServer / OpenGauss, and their factories still listSAMPLE_SHARDING_THRESHOLD(and in some casesINVERSE_SAMPLING_RATE), but notSAMPLE_SHARDING_ENABLEitself.
That means the runtime can read the option, and the docs now advertise it, but the official connector option contract is still incomplete for those CDC connectors.
I recommend adding SAMPLE_SHARDING_ENABLE to each JDBC-based CDC source factory optionRule().optional(...) list, and then aligning the docs scope as needed (for example OpenGauss, if it is intended to inherit the same capability).
Once that public contract layer is aligned with the runtime wiring, I’m happy to take another look.
DanielLeens
left a comment
There was a problem hiding this comment.
I fetched the branch locally, ran DynamicChunkSplitterTest, and traced the main split path in AbstractJdbcSourceChunkSplitter / DynamicChunkSplitter.
The new sample-sharding.enable flag is wired through the config objects into the splitter decision path, and I did not spot a blocker in the normal enumeration flow. The direction looks good to me.
Residual risk: the change mainly affects the uneven-distribution planning path, so an extra regression case with sample-sharding.enable = false on a skewed table would make the intent even clearer.
done @DanielLeens please help me Review? |
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I pulled the latest branch locally and there is still one blocking completeness gap in the new option surface.
The PR adds split.sample-sharding.enable / split.inverse-sampling.rate across the JDBC / CDC splitter path, but OpengaussIncrementalSourceFactory#optionRule() still only exposes split.sample-sharding.threshold in its optional set. The runtime may still read unknown keys from config, but the public factory contract / validation surface for OpenGauss is now inconsistent with the feature that this PR advertises for CDC/JDBC.
Please either wire the new options into the OpenGauss factory as well, or narrow the documented / supported scope of this PR.
DanielLeens
left a comment
There was a problem hiding this comment.
I re-pulled the current HEAD locally and rechecked the actual option flow from the public connector contract into the splitter path.
Current runtime chain:
FactoryUtil / optionRule -> JdbcSourceConfigFactory -> BaseSourceConfig / JdbcSourceConfig -> AbstractJdbcSourceChunkSplitter or DynamicChunkSplitter
The main blocker still remains on the public contract surface:
connector-cdc-opengauss/.../OpengaussIncrementalSourceFactory.java:57-81still exposes onlySAMPLE_SHARDING_THRESHOLDinoptionRule()- while JDBC-based CDC factories such as
connector-cdc-mysql/.../MySqlIncrementalSourceFactory.java:72-81already publishSAMPLE_SHARDING_ENABLE(and the related new options)
So the execution path may be wired in shared code, but the user-visible / validation / metadata contract is still incomplete for OpenGauss. For a PR that advertises the new CDC/JDBC sampling controls, that is still a merge blocker from my side.
Please either:
- wire the OpenGauss factory option surface to the same feature set, or
- explicitly narrow the documented and supported scope of this PR.
Separately, the current Build check is still red.
thanks, Revise the PR description to use Plan 2 and explicitly narrow down the PR scope to exclude OpenGauss |
DanielLeens
left a comment
There was a problem hiding this comment.
I re-pulled the latest HEAD locally and rechecked the public option surface from the connector contract into the splitter path.
Current runtime chain:
Factory optionRule -> JdbcSourceConfigFactory / JdbcSourceConfig -> shared splitter (AbstractJdbcSourceChunkSplitter / DynamicChunkSplitter)
Compared with the revision I previously blocked:
- the in-scope CDC factories on the current branch now expose
SAMPLE_SHARDING_ENABLEin their option surface (MySQL / Oracle / PostgreSQL / SQLServer) - the PR body now explicitly narrows scope and calls out OpenGauss as out of scope for a follow-up PR
So the blocker from my previous review is addressed in the current revision.
I still see one non-code item to watch: the current Build check is red, and the failed item is the unit-test shard Run / unit-test (11, ubuntu-latest) on the contributor workflow. I could identify the failing shard from the check metadata, but I did not find a new code blocker in the changed files themselves.
…ATE to CDC factory optionRule()
…sampling Rename the config option per reviewer suggestion: - CDC: sample-sharding.enable -> split.allow-sampling - JDBC: split.sample-sharding.enable -> split.allow-sampling Made-with: Cursor
4f93322 to
ad533ef
Compare
DanielLeens
left a comment
There was a problem hiding this comment.
I re-pulled the latest HEAD locally and rechecked the current JDBC / CDC split-planning path on the new revision.
Current runtime chain:
factory optionRule
-> JdbcSourceConfig / JdbcSourceConfigFactory
-> shared splitter gate in DynamicChunkSplitter and AbstractJdbcSourceChunkSplitter
-> if (sampleShardingEnable && sampleShardingThreshold < shardCount) decide sampling vs iterative uneven splitting.
On the current HEAD, the latest delta is mainly the user-facing rename to split.allow-sampling, and that rename is wired consistently through the public option surface and runtime path:
- JDBC option definition:
connector-jdbc/.../JdbcSourceOptions.java:99-106 - CDC option definition:
connector-cdc-base/.../JdbcSourceOptions.java:146-153 - JDBC splitter gate:
DynamicChunkSplitter.java:299-317 - CDC splitter gate:
AbstractJdbcSourceChunkSplitter.java:167-190
I also rechecked the contract layer that I previously blocked. The in-scope CDC factories still expose the new option surface on the current HEAD, while OpenGauss remains explicitly narrowed out of scope for a follow-up.
Conclusion: mergeable from a code-review perspective.
Blocking items: none in the current diff.
Suggested follow-up: none from me. The remaining item is the queued Build check.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest head locally again and checked the delta since my previous approval.
Current runtime chain:
factory optionRule
-> JdbcSourceConfig / JdbcSourceConfigFactory
-> shared splitter gate in DynamicChunkSplitter and AbstractJdbcSourceChunkSplitter
-> if (sampleShardingEnable && sampleShardingThreshold < shardCount)
sampling path
else
iterative uneven-split path
The newest commit after my previous review is formatting-only in JdbcSourceConfigFactory.java; it does not change the public option contract or the sampling-gate semantics. So my earlier code-level conclusion still stands: I do not see a new blocker in the current diff.
The remaining blocker is CI, not the code path itself: the current Build check is red. Please rerun or fix that workflow before merge.
Closes #10596
Purpose of this pull request
Add a
sample-sharding.enableoption for CDC connectors andsplit.sample-sharding.enablefor JDBC connector, allowing users to explicitly disable sampling-based sharding and fall back to iterative (unevenly-sized) chunk splitting.Scope
This PR targets the shared JDBC/CDC splitter layer and wires the new options into the following factories:
JdbcSourceFactoryOut of scope: OpenGauss CDC — its
OpengaussIncrementalSourceFactory#optionRule()does not yet expose these options. This will be addressed in a separate follow-up PR.Does this PR introduce any user-facing change?
New optional config:
split.sample-sharding.enable(Boolean, defaulttrue)sample-sharding.enable(Boolean, defaulttrue)When set to
false, the splitter falls back to unevenly-sized chunk splitting (iterative query approach) regardless of the shard count.How was this patch tested?
DynamicChunkSplitterTest: verifies config parsing and default valueJdbcMysqlSplitIT: verifies thatenable=trueuses sampling path whileenable=falsefalls back to unevenly-sized chunk splitting, producing different split countsChecklist
true, fully backward compatible)