Skip to content

feat: block producer redesign#502

Merged
Mirko-von-Leipzig merged 35 commits intonext-block-producerfrom
mirko-block-producer-redesign
Oct 7, 2024
Merged

feat: block producer redesign#502
Mirko-von-Leipzig merged 35 commits intonext-block-producerfrom
mirko-block-producer-redesign

Conversation

@Mirko-von-Leipzig
Copy link
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig commented Sep 19, 2024

This is still a WIP.

Notably, this only demonstrates the transaction pool implementation and even there skips some parts like handling notes and nullifiers.

Does not include batch producer nor block producer; as I think they should be rather straight-forward but depend on getting the pool interface correct.

We should also discuss how we want to integrate this work once its complete. The work itself is fairly non-trivial so it may be best to keep the existing design in-place until this new design has sufficient confidence. aka I'm not going to replace any existing code until we are very ready - probably split across several PRs?

This work is also missing revoking/dropping transactions e.g. due to recency conditions. This is possible to add, but since we don't actually have this feature yet I've left it out.

Design overview & intent

The pool tracks inflight transactions, batches and blocks. The definition of inflight is extended slightly to include all data that we consider not stale. This includes some number of blocks which we know to already be complete and in the store. This is done to minimize race conditions for new transactions inputs from the store (by providing several block's of overlap/grace).

Transactions, batches and blocks are all handled similarly. They're kept in a pool and are only removed once they're considered stale. Dependency graphs are bi-directional i.e. they store both children and parents (except blocks which are always sequential so no need). Once items are removed due to staleness all references to them are removed. This means all items/graph edges should always be in the pool which we enforce using .expect(..) as this would indicate a rather serious bug.

Transaction and batch selection works by tracking what I called "roots". A node is considered a root if all of its ancestors have already been processed e.g. a transaction root's parents are all already part of some inflight batch, and a batch root's parents are all already part of some inflight block.

Outstanding work

I anticipate the need for a lot of tests unless we can find some way of abstracting parts of this. Though I think in general graph book keeping is just painful so we just need to be thorough.

The rest of the state needs to be added, and the block and batch producers need at least a basic implementation so we can modify the pool api accordingly (missing outputs).

We need to pay particular attention to code that modifies the graphs; and should consider more expensive assertations, or runtime audits until we have performance concerns when running live?

Improve naming and comments which I'll get to.

We need to decide what to do in case of failure. Right now we don't really have any options except abort all transactions forming part of the failure?

Given that this is quite a lot of fairly intertwined code, I'm thinking maybe we should split this PR up more once tests are written? e.g. PR 1 includes just add_transaction and its tests and then we build from there.

@Mirko-von-Leipzig Mirko-von-Leipzig added the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Sep 19, 2024
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I didn't not review all the logic, but did go through the most of the high-level structure and left some small comments inline.

The main comment is that I'd probably prefer to have dedicated structs which encapsulate specific pieces of functionality, and then use these structs in the TransactionPool.

Comment on lines +17 to +18
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct BatchId(u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For batch ID we can use a hash of all transaction IDs that go into the batch (using some stable order). This way, two batches with the same set of transactions would end up having the same ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be ideal but I have concerns over how expensive that hash would be? This has to be performed inline meaning it extends the transaction pool lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on which hash function we use:

  • If we go with something like BLAKE3, computing a hash of 16 transaction IDs should take less than 1 microsecond.
  • If we use our arithmetization-friendly hash (RPO) it would be around 50 microseconds (or around 25 microseconds if/when we migrate to RPX hash function).

We could probably start with BLAKE3 and switch to RPO/RPX only if needed, but I think either way it may be OK as my expectation was that batch selection will probably take on the order of 1 millisecond anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this as is still. I think we may want to separate the concept of batch ID and batch job ID for the situation where we drop a batch and re-create the exact same one. This would result in two "inflight" batches with the same ID.

Maybe this is fine; will have to think about it some more. For now both count and hashed are still around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to separate the concept of batch ID and batch job ID for the situation where we drop a batch and re-create the exact same one. This would result in two "inflight" batches with the same ID.

As far as I can tell, this shouldn't happen because we can't put the same transaction into two different batches at the same time (or at least we shouldn't). So, we could have two batches with the same ID - but it shouldn't be at the same time.

My preference is still to use a hash-based ID as I think it is conceptually a bit simpler and could help us catch some errors (i.e., if we do detect 2 batches with the same ID in flight, something probably went wrong).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example:

Two inflight batches N and N+1 where N+1 depends on N.

Batch N fails to prove and we inform the mempool of this. The mempool requeues all transactions from N and N+1.

Select two new batches and we get N and N+1 again. We now have three batches inflight, with a duplicate N+1.

Note that the batch producer has no idea of any dependencies so it cannot automatically drop the original N+1.

@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as draft September 23, 2024 12:49
@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as ready for review September 25, 2024 16:23
@Mirko-von-Leipzig
Copy link
Collaborator Author

I've refactored quite a bit; and I've added implementations for the rpc add_transaction and a batch producer with a worker pool.

There is still quite a bit missing. Notably tests, checking all inputs, and the block producer implementation.

@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko-block-producer-redesign branch from 620463e to 05e4e75 Compare September 26, 2024 07:08
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! This is not a full review, but I did leave some comments inline - mostly minor though. I think the biggest remaining thing is to incorporate note/nullifier tracking as this may affect the structure a bit.

@Mirko-von-Leipzig Mirko-von-Leipzig changed the base branch from next to next-block-producer October 4, 2024 06:56
@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as draft October 4, 2024 06:56
@Mirko-von-Leipzig Mirko-von-Leipzig force-pushed the mirko-block-producer-redesign branch from 05e4e75 to 45627f3 Compare October 4, 2024 14:35
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an example; more scrutiny should be applied when we fill in the details.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an example.

Comment on lines +28 to +41
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct BatchJobId(u64);

impl Display for BatchJobId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
self.0.fmt(f)
}
}

impl BatchJobId {
pub fn increment(mut self) {
self.0 += 1;
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about using the hash as ID (this comment). We can punt that discussion to when we implement the batch building though.

Comment on lines +194 to +202
/// The latest committed state.
///
/// Only valid if the committed count is greater than zero.
committed_state: Digest,

/// The number of committed transactions.
///
/// If this is zero then the committed state is meaningless.
committed_count: usize,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we can just move this back into a VecDequeue.

Comment on lines +35 to +36
/// The number of transactions that affected each account.
account_transactions: BTreeMap<AccountId, usize>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do additional checks here if we stored the actual state transitions.

/// Transactions are returned in a valid execution ordering.
///
/// Returns `None` if no transactions are available.
pub fn select_batch(&mut self) -> Option<(BatchJobId, Vec<TransactionId>)> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intentionally returns IDs for now.

/// # Panics
///
/// Panics if there is already a block in flight.
pub fn select_block(&mut self) -> (BlockNumber, BTreeSet<BatchJobId>) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally returns IDs (for now).

@Mirko-von-Leipzig
Copy link
Collaborator Author

Mirko-von-Leipzig commented Oct 4, 2024

As discussed offline, this work will be done on feature branch next-block-producer until we can finalize the entire body of work.

I'll summarize what this PR contains, and what it purposefully punts for future PRs. Some of this future work results in less-than-stellar current naming/structuring as its going to change soon. I'll attempt to point these sore spots out in a self-review so we don't focus on them too hard.

Notable changes from last review

  • State tracking moved to its own module/types.
  • Graph data is removed as soon as it is committed instead of waiting for staleness.
  • Added "state diffs" as a way to commit/revert batches of transactions, instead of keeping transactions around until they're pruned.
  • Block and batch builder examples.
  • Independent batches in a block.
  • Nullifier support.

Included in this PR

  • dependency tracking at transaction, batch and block levels
  • basic batch selection
  • basic block selection (ensures batches within a block are independent)
  • tracking and reverting inflight state (account state & nullifiers)

Excluded

  • input and output note state tracking
  • data propagation beyond the mempool
  • batching, a simple worker pool is demonstrated on dummy data
  • block production, a simple worker loop is demonstrated on dummy data
  • tests
  • replacing existing framework
  • reverting individual transactions (should be simple)

What's next

Adding stronger typing to transactions, notes and transaction inputs will inform the structure of things; and I believe note tracking will fall out of this naturally.

Following that, implementing the types for batches and blocks makes sense I think.

I also want to raise info! and panic! out of the mempool, and instead have those handled in the callers. This won't be trivially possible everywhere but I think it makes more sense that way.

@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as ready for review October 4, 2024 15:08
}

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct BlockNumber(u32);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably belongs in base? Something else I've been itching to do is new-type all of the different fields in the block header for example. Having all of them be Digest/u32 isn't great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think moving this to miden-base would be a good idea. Would you like to make a PR there?

For the other fields - also agreed, but maybe we do these incrementally over time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. And incrementally is good; no rush :) One can also over do it.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! Not a very thorough review from me - but I did leave a few comments inline (mostly for the future).

}

async fn build_and_commit_block(&self, batches: BTreeSet<BatchJobId>) -> Result<(), ()> {
todo!("Aggregate, prove and commit block");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will contain building the block and saving it to the store, right?

For block building, since we don't actually prove anything yet, I am thinking we could put in an artificial delay of 3 - 5 seconds to get the conditions close to the real thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I was also considering injecting failures randomly as mentioned here.


fn spawn(&mut self, id: BatchJobId, transactions: Vec<TransactionId>) {
self.0.spawn(async move {
todo!("Do actual work like aggregating transaction data");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above comment: batch building could also take 3 - 5 seconds. Let's artificially simulate this for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we should also simulate failures. So something like sleep a random period, and then randomly fail a batch every now and then.

If we do want this, the failure rate should probably be a configurable parameter. Similar for the entire block.

Though this might annoy users if this causes a user tx to expire. And could look bad ito throughput; though I don't think that's an issue right now (?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is configurable and not too difficult to add, I don't mind doing it.

@Mirko-von-Leipzig Mirko-von-Leipzig merged commit ed72500 into next-block-producer Oct 7, 2024
@Mirko-von-Leipzig Mirko-von-Leipzig deleted the mirko-block-producer-redesign branch October 7, 2024 11:05
Mirko-von-Leipzig added a commit that referenced this pull request Nov 3, 2024
Initial outline of the block-producer redesign.

Includes a new mempool which can track transaction and batch dependencies as a graph. This
enables more complex mempool strategies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog This PR does not require an entry in the `CHANGELOG.md` file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants