Skip to content

refactor(mempool): revert to separate DAGs#1820

Open
Mirko-von-Leipzig wants to merge 40 commits intonextfrom
mirko/mempool-v4
Open

refactor(mempool): revert to separate DAGs#1820
Mirko-von-Leipzig wants to merge 40 commits intonextfrom
mirko/mempool-v4

Conversation

@Mirko-von-Leipzig
Copy link
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig commented Mar 23, 2026

This PR effectively rewrites the mempool implementation (again). The result is a combination of the first implementation and the current implementation, hopefully resulting in a best of both.

Background

The former was simple -- separate graphs for txs and batches, with nodes moving from one to the next as they are selected. It also had a singular inflight state which was checked when txs were added, but never after again. This meant it was had complex internal transitions which were never checked as nodes moved between graphs. However, within a single graph things were simple, and the graph impl could be shared.

The current implementation has a single graph, with transaction nodes being folded into batch nodes, which fold into block nodes. This graph impl is much more complex, in particular the fold and unfold functions are difficult to grok and test, in part due to pass-through nodes causing DAG properties to no longer hold. The positive is that state cannot really be corrupted since there aren't multiple graphs to make mistakes between.

Why

The driving motivation was to do something about fold and unfold. I couldn't convince myself that the implementations were ever 100% correct, nor that they would survive changes over time.

Eventually I realised that folding and unfolding could be achieved by first reverting all the node subgraphs, and then re-inserting with the new replacement nodes. Reverting and inserting are simple (relatively speaking), and we can lean on the existing insertion and reverting checks as additional assertions during this process.

I then realised there wasn't much need for a single graph anymore, since the original motivation was to support folding and unfolding. However we can use the above to add additional safety to the underlying graph by enforcing that nodes can only ever be reverted if they're leaves, and pruned if they're roots aka either no parents or no children remain in the graph.

This PR

A single graph type is created, which holds its state, and edges based on that state. Nodes can only be appended, reverted (removed from the top/leaves), or pruned (removed from the bottom/roots). Nodes can be marked as selected (to mark a tx/batch as included in batch/block).

Transaction and batch graphs wrap this underlying graph type. This is very similar to the original implementation, with some improvements and additional state checks due to the lessons learnt from the current implementation.

The main Mempool functions should be much easier to read (imo). This implementation is a pure refactor - user batches, and tx reverting improvement strategies will be in follow-up PRs so they can be reviewed without this huge diff.

How to review

There is nearly no current code that applies ito graph. They're also in separate files (which is a positive imo).

I suggest:

  • Review mempool; since this business level logic shouldn't change
    • It should be correct wrt to the claims made in the batch and transaction top-level graphs.
  • Review the state and edge files, then merge that logically with the assumptions made in the graph impl.

Closes #1439, closes #1253, closes #537

@Mirko-von-Leipzig Mirko-von-Leipzig requested review from bobbinth, igamigo and sergerad and removed request for igamigo March 23, 2026 15:38
@Mirko-von-Leipzig Mirko-von-Leipzig added the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Mar 23, 2026
@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as ready for review March 23, 2026 15:39
//
// This is done to prevent a system bug from causing repeated failures if we keep retrying
// the same transactions. Since we can't trivially identify the cause of the block
// failure, we take the safe route and nuke all associated state.
Copy link
Collaborator

@sergerad sergerad Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if this is actually a trade-off we want to make. There could be a situations where a transient error gets us here and then users need to resubmit transactions right? Which would be a confusing / unexpected UX?

I also don't follow why the rollback_batch() requeues transactions while this doesn't. Why does that make sense and what is the intention?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an arbitrary decision, and we could do either. At the time we decided that block failure was much worse and should be avoided at all costs, whereas batches would be more likely to shuffle the same transactions in different orders.

We do have #594 to address this, and I'm opening a PR hopefully today that basically gives each transactions some number of allowed failures before they are nuked.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1832 which reverts transactions only after they've received four infractions.

while let Some((id, tx)) = self.inner.selection_candidates().pop_first() {
if budget.check_then_subtract(tx) == BudgetStatus::Exceeded {
break;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to order tx candidates by size so that a single fat tx doesn't block a series of smaller ones unnecessarily? Unsure if this is a real problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a concern; will be addressed once fees are more realistic. In that case we would prioritise txs based on fees vs effort (or some other strategy).

#1242 is also related to this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider pop_first a placeholder for a more complex selection strategy

.iter()
.flat_map(|block| block.iter())
.map(|batch| batch.transactions().as_slice().len())
.sum::<usize>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think its a problem because O(blocks*batches) is fine but just noting that this inject_telemetry function appears to be called frequently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to avoid as much additional book keeping as possible; to ensure we don't desync somewhere. This could be replaced by a counter which is incremented and decremented appropriately though.

Might be worth doing at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog This PR does not require an entry in the `CHANGELOG.md` file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve mempool graph replacement API Consider erasing data in proposed mempool nodes [block-producer]: improve InflightState error handling

2 participants