[Fix][Zeta] Prevent NPE by lazy initializing overviewMap#10610
[Fix][Zeta] Prevent NPE by lazy initializing overviewMap#10610dybyte wants to merge 1 commit intoapache:devfrom
Conversation
Issue 1: Double-checked locking may have subtle visibility issuesLocation: private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
overviewMap =
nodeEngine
.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
}
}
}
return overviewMap;
}Related Context:
Problem Description: If Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestions: // Suggestion: Ensure that the nodeEngine reference is also final or volatile
// The constructor already ensures that nodeEngine will not change, so the current code is safe
// But for code clarity, it is recommended to add comments
/**
* Lazily initializes the overviewMap to avoid potential NPE during
* early startup phase when Hazelcast services may not be fully ready.
* Uses double-checked locking for thread-safety.
*/
private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
overviewMap = nodeEngine.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
}
}
}
return overviewMap;
}Rationale:
Issue 2: Exception timing changes make debugging difficultLocation: // Before modification: constructor
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.overviewMap = nodeEngine.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR); // Exception is thrown immediately here
this.maxHistorySize = maxHistorySize;
}
// After modification: constructor
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.nodeEngine = nodeEngine;
this.maxHistorySize = maxHistorySize;
// Exception is delayed until the first call to getOverviewMap()
}Related Context:
Problem Description:
Potential Risks:
Impact Scope:
Severity: MAJOR Improvement Suggestions: // Solution 1: Validate in the constructor that IMap can be obtained (but don't save the reference)
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.nodeEngine = nodeEngine;
this.maxHistorySize = maxHistorySize;
// Verify that the Hazelcast instance is ready, but don't save the reference to avoid potential timing issues
try {
nodeEngine.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
} catch (Exception e) {
throw new IllegalStateException(
"Failed to access Hazelcast IMap during CheckpointMonitorService initialization", e);
}
}
// Solution 2: Lazy initialization with a pre-check method added
public void ensureInitialized() {
getOverviewMap();
}
// Called in SeaTunnelServer.startMaster()
private void startMaster() {
coordinatorService = new CoordinatorService(...);
checkpointService = new CheckpointService(...);
checkpointMonitorService = new CheckpointMonitorService(nodeEngine, 32);
checkpointMonitorService.ensureInitialized(); // Ensure successful initialization
// ...
}Rationale:
Issue 3: Missing logging for initialization timingLocation: private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
// Missing logging
overviewMap = nodeEngine
.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
}
}
}
return overviewMap;
}Related Context:
Problem Description:
This makes troubleshooting in production environments difficult. Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestions: private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
log.info("Initializing checkpoint monitor IMap: {}",
Constant.IMAP_CHECKPOINT_MONITOR);
long startTime = System.nanoTime();
try {
overviewMap = nodeEngine
.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
long duration = System.nanoTime() - startTime;
log.debug("Checkpoint monitor IMap initialized in {} μs",
duration / 1000);
} catch (Exception e) {
log.error("Failed to initialize checkpoint monitor IMap", e);
throw e;
}
}
}
}
return overviewMap;
}Rationale:
Issue 4: Missing documentation explaining the reason for lazy initializationLocation: @Slf4j
public class CheckpointMonitorService {
private final NodeEngine nodeEngine;
private volatile IMap<Long, CheckpointOverview> overviewMap;
private final int maxHistorySize;
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.nodeEngine = nodeEngine;
this.maxHistorySize = maxHistorySize;
}
private IMap<Long, CheckpointOverview> getOverviewMap() {
// Missing JavaDoc explaining why lazy initialization is necessary
if (overviewMap == null) {
// ...
}
return overviewMap;
}Related Context:
Problem Description:
Potential Risks:
Impact Scope:
Severity: MINOR Improvement Suggestions: /**
* Service for monitoring checkpoint progress and statistics.
*
* <p>The underlying IMap is lazily initialized to avoid potential NPE
* during early startup phase when Hazelcast services may not be fully ready.
* This deferred initialization ensures the service can be instantiated
* before all distributed map structures are available.
*
* @see <a href="https://github.com/apache/seatunnel/issues/10570">Issue 10570</a>
*/
@Slf4j
public class CheckpointMonitorService {
/**
* Lazily initialized IMap for storing checkpoint overview data.
* Uses double-checked locking pattern for thread-safety.
*
* @return the initialized IMap
*/
private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
overviewMap = nodeEngine
.getHazelcastInstance()
.getMap(Constant.IMAP_CHECKPOINT_MONITOR);
}
}
}
return overviewMap;
}Rationale:
Issue 5: Bug root cause not verified, fix may be overly defensiveLocation: Entire PR Problem Description:
Under normal Hazelcast semantics:
Potential Risks:
Impact Scope:
Severity: MAJOR (as a verification requirement before PR merge) Improvement Suggestions: Before merging, should: // 1. Add specific reproduction steps for Issue #10570 in the PR description
// 2. Add unit tests to prove the bug exists
@Test
public void testNPEWhenConstructorInitialized() {
// Simulate the scenario where Hazelcast is not fully initialized
// Verify that immediate initialization will fail
}
// 3. Or, if the bug cannot be verified, consider reverting this PR
// Until there is conclusive evidence that the problem existsRationale:
|
corgy-w
left a comment
There was a problem hiding this comment.
Overall LGTM.
But , is there a way to cover the test
I think this is caused by latency during IMap recovery from external storage, |
基本信息
PR 解决了什么问题(小白示例)
一、代码变更审查1.1 核心逻辑分析变更内容精确描述核心改动仅发生在 1 个文件:
修改前 / 修改后代码片段(关键)修改前( private final IMap<Long, CheckpointOverview> overviewMap;
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.overviewMap =
nodeEngine.getHazelcastInstance().getMap(Constant.IMAP_CHECKPOINT_MONITOR);
this.maxHistorySize = maxHistorySize;
}修改后( private final NodeEngine nodeEngine;
private volatile IMap<Long, CheckpointOverview> overviewMap;
public CheckpointMonitorService(NodeEngine nodeEngine, int maxHistorySize) {
this.nodeEngine = nodeEngine;
this.maxHistorySize = maxHistorySize;
}
private IMap<Long, CheckpointOverview> getOverviewMap() {
if (overviewMap == null) {
synchronized (this) {
if (overviewMap == null) {
overviewMap =
nodeEngine.getHazelcastInstance().getMap(Constant.IMAP_CHECKPOINT_MONITOR);
}
}
}
return overviewMap;
}另外, 关键发现(2-4 条高密度结论)
逻辑正确性深度分析(结合调用链)(1)NPE 触发点为何可能在构造阶段出现
旧实现里,构造函数立即执行:
所以“构造阶段 NPE”成立的必要条件是: (2)修改后何时生效、何时不会生效
完整运行链路(关键主链 + 触发点)问题 1:lazy init 仍可能在“首次访问过早”时 NPE,且缺少明确错误上下文(中)
问题 2:引入双重检查锁但缺少注释说明“为何必须这么做”(低)
问题 3:缺少回归测试覆盖该 NPE 场景(低)
1.2 兼容性影响
1.3 性能 / 副作用分析
1.4 错误处理与日志
二、代码质量评估2.1 代码规范
2.2 测试覆盖
本地验证结果说明:当前 Codex 沙盒为只读且无法联网,无法
2.3 文档更新
三、架构合理性3.1 解决方案的优雅性
3.2 可维护性
3.3 扩展性
3.4 历史版本兼容性
四、问题汇总
五、是否可以 Merge 的结论结论:可以 merge(但建议补强“可诊断性/可回归性”)阻塞项(必须修复)
建议修复(非阻塞)
整体评价
|
Fixes: #10570
Purpose of this pull request
Prevent potential NPE caused by early initialization of
overviewMap.overviewMapwas previously initialized in the constructor, which may be invoked beforePartitionServiceis fully ready. This change lazily initializes the map to ensure it is accessed only when needed.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Covered by existing tests.
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.