Skip to content

Conversation

@julienrbrt
Copy link
Member

Closes: #2994

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the SyncService's P2P initialization process. It introduces a mechanism to perform background retries for P2P synchronization if the initial attempt times out, preventing the service from getting stuck and ensuring eventual connectivity. This change improves the overall resilience and startup reliability of the synchronization component.

Highlights

  • Background Context for SyncService: Introduced bgCtx and bgCancel fields to the SyncService struct to manage background operations, ensuring proper cancellation when the service stops.
  • Refactored P2P Initialization Logic: Extracted the core logic for a single P2P initialization attempt into a new tryInit method, improving code modularity and reusability.
  • Background P2P Initialization Retries: Implemented a new retryInitInBackground goroutine that is launched if the initial P2P synchronization times out. This goroutine continuously attempts to initialize the syncer in the background with exponential backoff, preventing indefinite blocking of the main thread.
  • Reduced Initial P2P Sync Timeout: The initial P2P synchronization timeout in initFromP2PWithRetry has been reduced from 10 minutes to 5 minutes, allowing background retries to commence sooner if the initial attempt fails.
  • Graceful Shutdown of Background Operations: Added a call to bgCancel() in the Stop method of SyncService to ensure that any ongoing background initialization retries are properly terminated when the service is shut down.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 20, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 20, 2026, 1:12 PM

@claude
Copy link
Contributor

claude bot commented Jan 20, 2026

Claude finished @julienrbrt's task in 2m 17s —— View job


Code Review Complete

I've performed a comprehensive review of the P2P initialization retry changes. Overall, the implementation is solid and addresses the timeout issue effectively. However, I've identified several important concerns across code quality, concurrency safety, testing, and error handling.


Critical Issues

1. Race Condition in Background Goroutine Lifecycle 🔴

Location: pkg/sync/sync_service.go:384

The background goroutine launched in initFromP2PWithRetry has no synchronization mechanism to track its lifecycle. This creates several problems:

  • No wait mechanism in Stop(): When Stop() is called, it cancels bgCtx but doesn't wait for the goroutine to finish. This can lead to goroutine leaks and race conditions during shutdown.
  • Potential for multiple background goroutines: If the service is started/stopped/started, you could end up with orphaned goroutines from previous runs.
  • Testing difficulties: No way to verify the goroutine completed in tests.

Recommendation: Add a sync.WaitGroup to track the background goroutine:

type SyncService[H header.Header[H]] struct {
    // ... existing fields ...
    bgCtx    context.Context
    bgCancel context.CancelFunc
    bgWg     sync.WaitGroup  // Add this
}

// In initFromP2PWithRetry:
case <-timeoutTimer.C:
    syncService.logger.Warn().Err(err).Msg("timeout reached while trying to initialize the store, scheduling background retry")
    
    syncService.bgWg.Add(1)
    go func() {
        defer syncService.bgWg.Done()
        syncService.retryInitInBackground()
    }()
    
    return nil

// In Stop():
func (syncService *SyncService[H]) Stop(ctx context.Context) error {
    // cancel background operations
    syncService.bgCancel()
    
    // Wait for background goroutine to finish
    syncService.bgWg.Wait()
    
    // ... rest of stop logic ...
}

Fix race condition →


2. Reduced Timeout Without Justification 🟡

Location: pkg/sync/sync_service.go:367

The initial timeout was reduced from 10 minutes to 2 minutes (in the latest fix commit). While shorter timeouts are generally good, this is a significant reduction (80%) that could cause issues in slow networks or during network partitions.

Questions:

  • Was this timeout tested in realistic network conditions?
  • What's the expected network latency for P2P peer discovery?
  • Could this cause unnecessary background retries in legitimate slow-start scenarios?

Recommendation: Either:

  1. Justify the 2-minute timeout with data/testing, or
  2. Make it configurable via config.Config, or
  3. Use a more conservative value like 3-5 minutes

High Priority Issues

3. Insufficient Test Coverage (8.16%) 🟡

Location: Overall change

Codecov reports only 8.16% coverage for this change (45 lines untested). The new functionality includes:

  • Background retry logic
  • Context cancellation
  • Exponential backoff in background
  • Timeout handling

Missing test scenarios:

  • Background retry eventually succeeding after initial timeout
  • Background retry being cancelled via Stop()
  • Multiple retry attempts with exponential backoff
  • Service restart while background retry is running
  • Concurrent calls to tryInit (race conditions)

Recommendation: Add integration tests covering:

func TestSyncService_BackgroundRetry_Success(t *testing.T) {
    // Test that background retry eventually succeeds
}

func TestSyncService_BackgroundRetry_Cancellation(t *testing.T) {
    // Test that Stop() properly cancels background retry
}

func TestSyncService_BackgroundRetry_ExponentialBackoff(t *testing.T) {
    // Test that backoff increases correctly
}

Add test coverage →


4. Potential Double Initialization Race 🟡

Location: pkg/sync/sync_service.go:341-346

The tryInit method uses CompareAndSwap to prevent double initialization, but there's a subtle race:

if syncService.storeInitialized.CompareAndSwap(false, true) {
    if _, err := syncService.initStore(ctx, trusted); err != nil {
        syncService.storeInitialized.Store(false)  // Reset on error
        return false, fmt.Errorf("failed to initialize the store: %w", err)
    }
}

If initStore fails, we reset the flag. But what if another goroutine (from WriteToStoreAndBroadcast) calls initStore concurrently? The same pattern exists there (lines 148-154).

Potential issue: Two goroutines could race between lines 341-346, both thinking they won the race, both trying to append to the store.

Recommendation: Add a mutex to protect the entire initialization sequence, or ensure initStore itself is idempotent and can handle concurrent calls safely.


5. Error Context Lost in Background Retry 🟡

Location: pkg/sync/sync_service.go:398-423

When the initial sync times out and switches to background mode, the last error is logged but not tracked. If background retry keeps failing, operators have no visibility into:

  • How many retry attempts have been made
  • Whether the same error keeps occurring
  • When the last attempt was made

Recommendation: Add metrics or structured logging:

func (syncService *SyncService[H]) retryInitInBackground() {
    backoff := 15 * time.Second
    maxBackoff := 5 * time.Minute
    attemptCount := 0

    for {
        // ... select statement ...
        
        attemptCount++
        ok, err := syncService.tryInit(syncService.bgCtx)
        if ok {
            syncService.logger.Info().
                Int("attempts", attemptCount).
                Msg("successfully initialized store from P2P in background")
            return
        }

        syncService.logger.Info().
            Err(err).
            Int("attempt", attemptCount).
            Dur("retry_in", backoff).
            Msg("background retry: headers not yet available from peers")
        
        // ... backoff logic ...
    }
}

Medium Priority Issues

6. Inconsistent Backoff Configuration 🟢

Location: pkg/sync/sync_service.go:364-365 and 399-400

Two different backoff configurations exist:

  • Initial retry: starts at 1s, max 10s
  • Background retry: starts at 15s, max 5m

Questions:

  • Why does background retry start higher (15s) than initial max (10s)?
  • Should these be configurable rather than hardcoded?
  • Is there a reason for the gap between 10s (initial max) and 15s (background start)?

Recommendation: Document the reasoning or consider smoother transition (e.g., background could start at 10s to continue where initial left off).


7. Context Confusion: Service vs Background 🟢

Location: pkg/sync/sync_service.go:410

tryInit is called with syncService.bgCtx during background retry. However, bgCtx is created from context.Background() with only cancellation support. If the original Start() context had deadlines or values, they're lost.

Consideration: Is this intentional? Should background retries respect the original context's deadline, or should they truly run independently?

Current behavior: Background retries run until either:

  1. Success
  2. Stop() is called
  3. Process exits

This seems reasonable for resilience, but should be documented.


8. Order of Select Cases in Background Retry 🟢

Location: pkg/sync/sync_service.go:403-408

The latest fix commit moved the context cancellation check to happen before the backoff sleep. This is correct and prevents an unnecessary sleep before shutdown, but it's worth noting:

Before (incorrect):

// tryInit happens first
// Then select with cancellation/sleep

After (correct):

// Check cancellation first
// Then tryInit

This is good! But consider adding a comment explaining why the order matters.


Minor Issues

9. Magic Numbers 🟢

Location: Multiple locations

Several hardcoded timeouts:

  • time.Minute * 2 (line 367)
  • 15 * time.Second (line 399)
  • 5 * time.Minute (line 400)

Recommendation: Extract to named constants for better maintainability:

const (
    initialSyncTimeout = 2 * time.Minute
    backgroundRetryInitialBackoff = 15 * time.Second
    backgroundRetryMaxBackoff = 5 * time.Minute
)

10. Documentation Completeness 🟢

Location: pkg/sync/sync_service.go:60-62

The new fields are documented:

// context for background operations
bgCtx    context.Context
bgCancel context.CancelFunc

But consider expanding to explain:

  • When background operations run (after initial timeout)
  • Why we need a separate context (independent lifecycle)
  • How they're cleaned up (via Stop())

Positive Observations ✅

  1. Good refactoring: Extracting tryInit as a separate method improves testability and reusability.
  2. Proper exponential backoff: Both retry loops implement sensible backoff with maximums.
  3. Context-aware: Properly respects context cancellation in retry loops.
  4. Clean separation: Background retry is clearly separated from initial sync.
  5. Logging discipline: Good use of structured logging with error context and durations.
  6. No breaking changes: API surface remains unchanged; internal implementation improvement.

Architecture & Design Questions

11. Should Background Retry Be Observable? 🤔

Currently, there's no way to:

  • Query if background retry is running
  • Get status of background retry attempts
  • Manually trigger retry completion check

Consider: Should SyncService expose a method like GetSyncStatus() that returns:

type SyncStatus struct {
    StoreInitialized bool
    SyncerStarted    bool
    BackgroundRetryActive bool
    LastRetryError error
    RetryAttempts int
}

This would help with monitoring and debugging.


12. What Happens to Incoming Headers During Background Retry? 🤔

If the store isn't initialized and background retry is running, what happens when WriteToStoreAndBroadcast is called?

Looking at lines 148-155, it seems like it would:

  1. Try to initialize the store (might race with background retry)
  2. Start the syncer
  3. Succeed

This is probably fine, but worth documenting that background retry is a "best effort" and the store can also be initialized via incoming writes.


Security Considerations

No security issues identified. The changes don't:

  • Expose sensitive data in logs
  • Introduce authentication/authorization issues
  • Create new external-facing APIs
  • Handle user input

Performance Considerations

13. Reduced Initial Timeout Might Increase Load 🟡

Reducing timeout from 10m to 2m means:

  • Faster fallback to background retry (good for UX)
  • But potentially more frequent retry attempts across the fleet
  • If many nodes start simultaneously (e.g., after network partition), they'll all hit background retry mode sooner

Impact: Likely negligible, but worth monitoring P2P network load patterns after deployment.


Summary & Recommendations

Must Fix Before Merge 🔴

  1. Add WaitGroup to track background goroutine lifecycle - Prevents goroutine leaks and race conditions during shutdown

Should Fix Before Merge 🟡

  1. Add test coverage for background retry scenarios - Critical functionality with only 8.16% coverage
  2. Document or justify the 2-minute timeout reduction - Significant change from 10 minutes
  3. Review potential race in store initialization - Multiple code paths can initialize concurrently

Nice to Have 🟢

  1. Add retry attempt counter to background logging
  2. Extract magic numbers to named constants
  3. Document the intentional context separation (service vs background)
  4. Consider making timeouts configurable

Questions for Author

  • Was the 2-minute timeout tested in realistic network conditions with slow peers?
  • Is the potential race in store initialization acceptable, or should we add mutex protection?
  • Should we expose background retry status for monitoring/debugging?

Test Plan Suggestion

To properly validate this change, consider testing:

  1. Happy path: Initial sync succeeds within 2 minutes
  2. Timeout path: Initial sync times out, background retry succeeds
  3. Cancellation: Service stopped while background retry is running
  4. Network recovery: Peers become available after background retry starts
  5. Concurrent writes: Headers arrive via WriteToStoreAndBroadcast while background retry runs
  6. Restart scenarios: Service restarted before background retry completes

Overall, this is a valuable improvement that prevents the service from blocking indefinitely. The core logic is sound, but the implementation needs better synchronization and test coverage to be production-ready.


Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background retry mechanism for P2P initialization in SyncService. When the initial synchronous initialization times out, it now schedules a background task to continue retrying. This is a good improvement for startup performance as it avoids blocking indefinitely.

The changes are well-structured, including the addition of a background context for lifecycle management and the extraction of initialization logic into a reusable tryInit method. However, I've found a critical issue in the new retryInitInBackground function. It appears to be incomplete, missing the exponential backoff logic and the closing brace for the for loop, which would cause a compilation error. Please see my detailed comment for a suggested fix.

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 8.16327% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.24%. Comparing base (140b24a) to head (cb212d6).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 8.16% 45 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3002      +/-   ##
==========================================
- Coverage   59.26%   59.24%   -0.03%     
==========================================
  Files         108      108              
  Lines       10103    10126      +23     
==========================================
+ Hits         5988     5999      +11     
- Misses       3484     3496      +12     
  Partials      631      631              
Flag Coverage Δ
combined 59.24% <8.16%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Header-sync start-up timeout blocks entire node for 10 min

2 participants