Skip to content

fix: race condition between Commit and CreateQueryContext#24508

Closed
beer-1 wants to merge 7 commits into
cosmos:mainfrom
initia-labs:fix/race
Closed

fix: race condition between Commit and CreateQueryContext#24508
beer-1 wants to merge 7 commits into
cosmos:mainfrom
initia-labs:fix/race

Conversation

@beer-1
Copy link
Copy Markdown
Contributor

@beer-1 beer-1 commented Apr 14, 2025

Description

Replace #24392


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

Please see Pull Request Reviewer section in the contributing guide for more information on how to review a pull request.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Summary by CodeRabbit

  • New Features
    • Enhanced synchronization measures improve stability and consistency during concurrent operations.
  • Refactor
    • Streamlined state management and context handling, ensuring smoother block and query processing.
  • Tests
    • Added comprehensive race condition testing to validate robust performance under intensive concurrent operations.

@ironbird-prod
Copy link
Copy Markdown

ironbird-prod Bot commented Apr 14, 2025

Ironbird - launch a network To use Ironbird, you can use the following commands:
  • /ironbird start OR /ironbird start --load-test-config= - Launch a testnet with the specified chain and load test configuration.
  • /ironbird chains - List of chain images that ironbird can use to spin-up testnet
  • /ironbird loadtests - List of load test modes that ironbird can run against testnet
Custom Load Test Configuration You can provide a custom load test configuration using the `--load-test-config=` flag:
/ironbird start cosmos --load-test-config={
  "block_gas_limit_target": 0.75,
  "num_of_blocks": 50,
  "msgs": [
    {"weight": 0.3, "type": "MsgSend"},
    {"weight": 0.3, "type": "MsgMultiSend"},
	{"weight": 0.4, "type": "MsgArr", "ContainedType": "MsgSend", "NumMsgs": 3300}
  ]
}

Use /ironbird loadtests to see more examples.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2025

📝 Walkthrough

Walkthrough

The changes enhance thread safety in the baseapp module by introducing proper synchronization for state management and commit operations. The modifications refactor state access by replacing direct field references with method calls (e.g., getState and clearState) that employ a new read/write mutex. Additionally, the commit-related data structure is updated to use atomic operations, and a new test is included to simulate concurrent queries and commits. These adjustments ensure that shared resources, such as the application state and commit information, are accessed and modified safely.

Changes

File(s) Change Summary
CHANGELOG.md Adds a new entry detailing the introduction of mutex locks for the state and atomic operations for lastCommitInfo in the baseapp module, addressing race conditions between Commit and CreateQueryContext.
baseapp/abci.go, baseapp/baseapp.go, baseapp/test_helpers.go Refactors state management by replacing direct access with calls to getState and clearState, and adds a sync.RWMutex (stateMut) to ensure thread-safe state modifications and retrievals.
baseapp/abci_test.go Introduces a new test function TestABCI_Race_Commit_Query that verifies concurrent commit and query operations, ensuring the proper handling of race conditions during block finalization and state updates.
store/rootmulti/store.go Updates the lastCommitInfo field type to atomic.Pointer[types.CommitInfo] and adjusts related methods to perform atomic operations, enhancing concurrency control in the commit process.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client as Client
    participant BA as BaseApp
    participant SM as State Mutex
    participant Store as Store (Atomic)

    %% Commit Flow
    Client->>BA: Issue Commit Request
    BA->>SM: Acquire Write Lock (setState/clearState)
    BA->>BA: Perform state update operations
    BA->>Store: Atomically update lastCommitInfo
    BA->>SM: Release Write Lock
    BA->>Client: Return Commit Response

    %% Query Flow
    Client->>BA: Issue Query Request
    BA->>SM: Acquire Read Lock (via getState)
    BA->>BA: Retrieve safe state context
    BA->>SM: Release Read Lock
    BA->>Client: Return Query Context
Loading

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e5e8dc and b16d0df.

⛔ Files ignored due to path filters (2)
  • go.mod is excluded by !**/*.mod
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (6)
  • CHANGELOG.md (1 hunks)
  • baseapp/abci.go (27 hunks)
  • baseapp/abci_test.go (2 hunks)
  • baseapp/baseapp.go (8 hunks)
  • baseapp/test_helpers.go (1 hunks)
  • store/rootmulti/store.go (7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
baseapp/abci_test.go (3)
baseapp/baseapp_test.go (1)
  • NewBaseAppSuite (66-96)
baseapp/options.go (1)
  • SetChainID (111-113)
store/rootmulti/store.go (1)
  • Store (60-83)
baseapp/baseapp.go (1)
runtime/events.go (1)
  • NewEventManager (30-33)
🪛 GitHub Actions: Tests / Code Coverage
baseapp/abci_test.go

[error] 202-202: TestABCI_InitChain_WithInitialHeight failed due to panic: runtime error: invalid memory address or nil pointer dereference.

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Analyze
  • GitHub Check: Summary
🔇 Additional comments (44)
CHANGELOG.md (1)

73-73: Addition of mutex locks and atomic operations to prevent race conditions.

This change improves thread safety in the baseapp module by adding mutex locks for state and making lastCommitInfo atomic to prevent race conditions between Commit and CreateQueryContext. This is a valuable improvement to the concurrency safety of the application.

baseapp/test_helpers.go (1)

63-64: Good refactoring to use synchronized state access methods.

Replacing direct state field access with the getState() method call enhances thread safety. This change properly aligns with the mutex-based state management introduced in the PR and ensures that all state access is properly synchronized.

Also applies to: 67-67

baseapp/abci_test.go (5)

2528-2574: Excellent addition of race condition test.

This test thoroughly verifies the PR's goal of addressing race conditions between Commit and query context creation. By running 100 concurrent goroutines that create query contexts while simultaneously finalizing and committing 1000 blocks, it effectively tests the synchronization mechanisms added to prevent race conditions.

The use of atomic.Uint64() to safely track the counter is also a good practice for concurrent testing.


2542-2542: Proper use of atomic counter for thread-safe metric tracking.

Using atomic operations for the counter ensures accurate tracking in a concurrent environment. This is the correct approach for shared counters in Go when accessed from multiple goroutines.


2544-2557: Well-structured concurrent test design.

The test properly uses context cancellation for graceful goroutine termination, which is a best practice for managing the lifecycle of concurrent operations. The goroutines are designed to continuously create query contexts until canceled, effectively stressing the synchronization mechanisms.


2559-2569: Good stress test for concurrent block processing.

This section effectively tests the core functionality by repeatedly finalizing and committing blocks while query contexts are being created concurrently. This is exactly what's needed to verify that the race condition has been addressed.


2571-2574: Proper test cleanup and verification.

Canceling the context to stop all goroutines and verifying the final block height is a clean way to conclude the test. This ensures all operations completed successfully and the application state is as expected.

baseapp/baseapp.go (8)

122-125: Good addition of documentation for concurrency safety.
These comments clarify that the states should be accessed via getter and setter to prevent race conditions, which is an excellent practice.


130-130: Introduction of an RWMutex to guard state updates.
Using sync.RWMutex ensures concurrent reads can happen without blocking on one another, while writes remain exclusive.


513-515: Locking in setState to avoid race conditions.
Acquiring a write lock here is crucial to ensure thread-safe updates to the state references within BaseApp.


536-556: New clearState method with appropriate locking.
This method mirrors setState's concurrency approach, preventing race conditions while resetting state references.


679-681: Read lock for getState.
Using an RLock for state retrieval is efficient while still preventing concurrent writes.


760-761: Using getState(execModeFinalize) for concurrency-safe retrieval.
This change is consistent with the new locked state access pattern introduced in BaseApp.


787-787: Refactoring to use getState(execModeFinalize) in beginBlocker.
Properly locks and retrieves the finalizeBlockState instead of direct struct access, preventing race conditions.


849-849: Refactoring to use getState(execModeFinalize) in endBlocker.
Similar to beginBlocker, this ensures consistent, concurrency-safe state access.

baseapp/abci.go (21)

67-67: Use of getState(execModeFinalize) for InitChain logic.
This unifies the approach to retrieving finalizeBlockState, improving thread safety.


85-86: Check state retrieval with concurrency-safe call.
Locking consistency is maintained when updating the checkState with the new header.


105-105: Explicitly setting infinite gas meter on the finalizeState.
Good approach to ensure no unintended gas constraints during initialization.


107-107: Passing the locked finalizeState.Context() to initChainer.
Ensures initialization is performed on the correct, thread-safe state context.


417-418: Accessing prepareProposalState via getState and setting context.
This mirrors the finalize/check states approach, preserving concurrency control.


430-432: Applying consensus params and block gas meter inside a locked context.
Proactively prevents race conditions during proposal preparation.


447-447: Calling prepareProposal with concurrency-safe context.
This pattern keeps consistent locking in place for proposal processes.


505-506: processProposalState retrieved through getState.
Using the locked approach helps avoid accidental concurrent writes to the state.


520-522: Consensus params and block gas meter set inside processProposalState.
Maintains uniform usage of concurrency-safe context usage.


537-537: Verifying proposals with a safe, locked context.
Ensures no shared mutable state is accessed improperly during ProcessProposal.


652-652: Using a cached context from getState(execModeFinalize) in ExtendVote.
Reinforces concurrency-safety when retrieving the finalize block context.


739-739: Fetching finalizeState in internalFinalizeBlock.
Double-checking for nil helps handle replay scenarios more gracefully.


741-742: Re-initializing finalize state if needed.
Ensures partial block replays or abnormal conditions won't skip setting a correct finalize state.


745-745: Updating context with block header info inside a locked finalize state.
Again, consistent concurrency approach for block finalization logic.


766-766: Re-applying block gas meter.
Ensures the block gas meter is updated after updating block and consensus parameters.


769-770: Updating checkState to share the same block gas meter.
Mirrors the concurrency logic for finalizeState, preventing out-of-sync meter usage.


838-838: Setting tracing context to nil in finalizeState's CacheMultiStore.
This avoids leftover trace context from polluting subsequent operations.


841-841: Using concurrency-safe finalizeState in endBlock.
Safer than direct references to potentially stale or partially updated state.


849-849: Caching context from finalizeState in VerifyVoteExtension.
Follows the same concurrency pattern, ensuring consistent state usage.


770-771: Loading commit info for the query if the height matches.
This properly checks the in-memory lastCommitInfo before hitting disk, aligning with the concurrency fix.


904-904: Clearing finalizeState in case of an aborted optimistic execution.
Resetting execModeFinalize is crucial to avoid stale references on the next attempt.

store/rootmulti/store.go (8)

13-13: New import of sync/atomic.
This atomic package usage helps solve concurrency anomalies around lastCommitInfo.


63-63: lastCommitInfo changed to atomic.Pointer[types.CommitInfo].
Storing commit info atomically avoids race conditions when multiple goroutines read/write commit metadata concurrently.


300-300: Storing the new commit info atomically.
Invoking rs.lastCommitInfo.Store(cInfo) ensures the new commit info is quickly visible in a thread-safe manner.


449-450: Loading commit info via atomic pointer.
This read approach is crucial for concurrency. The nil check ensures backward compatibility if no commit info was set.


458-458: Checking for an empty hash from atomic-loaded commit info.
Nicely handles corner-case scenarios without crashing if the stored commit info lacks a hash.


467-467: Returning the commit ID from the loaded commit info.
A simple, thread-safe read that relies on the updated atomic pointer changes.


473-473: Safely reading cInfo for version checks.
Prevents data races when evaluating the store’s current version and initial version.


493-493: Atomic store of commit info during commit.
Ensures other goroutines see the new commit state consistently and immediately.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@aljo242
Copy link
Copy Markdown
Contributor

aljo242 commented Apr 14, 2025

@aljo242 aljo242 added the v54 label Apr 14, 2025
@aljo242 aljo242 requested a review from Copilot April 14, 2025 20:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.

Files not reviewed (1)
  • go.mod: Language not supported
Comments suppressed due to low confidence (3)

store/rootmulti/store.go:300

  • Ensure that all accesses and updates to lastCommitInfo consistently use the atomic pointer operations to prevent any remaining race conditions.
rs.lastCommitInfo.Store(cInfo)

baseapp/baseapp.go:514

  • The introduction of mutex locks for state modifications is a good step; verify that all state accesses (including read paths) are guarded appropriately to maintain consistency under concurrent loads.
app.stateMut.Lock()

baseapp/abci_test.go:2528

  • [nitpick] The new race condition test is a positive addition; consider adding further tests for edge cases involving rapid state changes to fully cover concurrent access scenarios.
func TestABCI_Race_Commit_Query(t *testing.T) {

@beer-1
Copy link
Copy Markdown
Contributor Author

beer-1 commented Apr 15, 2025

@beer-1 https://github.com/cosmos/cosmos-sdk/actions/runs/14452369048/job/40527887477?pr=24508#step:6:16 looks like a panic failure in a baseapp test

Hey @aljo242 fixed the problem

@github-actions github-actions Bot removed C:Cosmovisor Issues and PR related to Cosmovisor C:x/nft labels Apr 16, 2025
@aljo242
Copy link
Copy Markdown
Contributor

aljo242 commented Apr 29, 2025

@technicallyty assigning you this to evaluate if there are any performance impacts with this and if this is breaking in any unexpected ways

@aljo242 aljo242 requested a review from technicallyty April 29, 2025 13:47
@technicallyty
Copy link
Copy Markdown
Member

fixed in #24655

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants