feat: block producer redesign#502
Conversation
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you! I didn't not review all the logic, but did go through the most of the high-level structure and left some small comments inline.
The main comment is that I'd probably prefer to have dedicated structs which encapsulate specific pieces of functionality, and then use these structs in the TransactionPool.
| #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
| pub struct BatchId(u64); |
There was a problem hiding this comment.
For batch ID we can use a hash of all transaction IDs that go into the batch (using some stable order). This way, two batches with the same set of transactions would end up having the same ID.
There was a problem hiding this comment.
This would be ideal but I have concerns over how expensive that hash would be? This has to be performed inline meaning it extends the transaction pool lock.
There was a problem hiding this comment.
I think this depends on which hash function we use:
- If we go with something like BLAKE3, computing a hash of 16 transaction IDs should take less than 1 microsecond.
- If we use our arithmetization-friendly hash (RPO) it would be around 50 microseconds (or around 25 microseconds if/when we migrate to RPX hash function).
We could probably start with BLAKE3 and switch to RPO/RPX only if needed, but I think either way it may be OK as my expectation was that batch selection will probably take on the order of 1 millisecond anyways.
There was a problem hiding this comment.
I've left this as is still. I think we may want to separate the concept of batch ID and batch job ID for the situation where we drop a batch and re-create the exact same one. This would result in two "inflight" batches with the same ID.
Maybe this is fine; will have to think about it some more. For now both count and hashed are still around.
There was a problem hiding this comment.
I think we may want to separate the concept of batch ID and batch job ID for the situation where we drop a batch and re-create the exact same one. This would result in two "inflight" batches with the same ID.
As far as I can tell, this shouldn't happen because we can't put the same transaction into two different batches at the same time (or at least we shouldn't). So, we could have two batches with the same ID - but it shouldn't be at the same time.
My preference is still to use a hash-based ID as I think it is conceptually a bit simpler and could help us catch some errors (i.e., if we do detect 2 batches with the same ID in flight, something probably went wrong).
There was a problem hiding this comment.
An example:
Two inflight batches N and N+1 where N+1 depends on N.
Batch N fails to prove and we inform the mempool of this. The mempool requeues all transactions from N and N+1.
Select two new batches and we get N and N+1 again. We now have three batches inflight, with a duplicate N+1.
Note that the batch producer has no idea of any dependencies so it cannot automatically drop the original N+1.
|
I've refactored quite a bit; and I've added implementations for the rpc There is still quite a bit missing. Notably tests, checking all inputs, and the block producer implementation. |
620463e to
05e4e75
Compare
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you! This is not a full review, but I did leave some comments inline - mostly minor though. I think the biggest remaining thing is to incorporate note/nullifier tracking as this may affect the structure a bit.
In anticipation of creating tx, batch and block pools.
Separate error types from error representation on the wire by splitting the code into an implementation and a wrapper which performs the RPC conversions. This separation gives us a single place to map certain errors to internal errors etc.
Essentially just a clone of the existing block builder with the mempool addition.
This includes extracting account state and a half-baked transaction purging.
In favor of implementing the details in other PRs.
05e4e75 to
45627f3
Compare
There was a problem hiding this comment.
This is just an example; more scrutiny should be applied when we fill in the details.
There was a problem hiding this comment.
Just an example.
| #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
| pub struct BatchJobId(u64); | ||
|
|
||
| impl Display for BatchJobId { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
| self.0.fmt(f) | ||
| } | ||
| } | ||
|
|
||
| impl BatchJobId { | ||
| pub fn increment(mut self) { | ||
| self.0 += 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
I'm still unsure about using the hash as ID (this comment). We can punt that discussion to when we implement the batch building though.
| /// The latest committed state. | ||
| /// | ||
| /// Only valid if the committed count is greater than zero. | ||
| committed_state: Digest, | ||
|
|
||
| /// The number of committed transactions. | ||
| /// | ||
| /// If this is zero then the committed state is meaningless. | ||
| committed_count: usize, |
There was a problem hiding this comment.
Alternatively we can just move this back into a VecDequeue.
| /// The number of transactions that affected each account. | ||
| account_transactions: BTreeMap<AccountId, usize>, |
There was a problem hiding this comment.
We could do additional checks here if we stored the actual state transitions.
| /// Transactions are returned in a valid execution ordering. | ||
| /// | ||
| /// Returns `None` if no transactions are available. | ||
| pub fn select_batch(&mut self) -> Option<(BatchJobId, Vec<TransactionId>)> { |
There was a problem hiding this comment.
This intentionally returns IDs for now.
| /// # Panics | ||
| /// | ||
| /// Panics if there is already a block in flight. | ||
| pub fn select_block(&mut self) -> (BlockNumber, BTreeSet<BatchJobId>) { |
There was a problem hiding this comment.
Intentionally returns IDs (for now).
|
As discussed offline, this work will be done on feature branch I'll summarize what this PR contains, and what it purposefully punts for future PRs. Some of this future work results in less-than-stellar current naming/structuring as its going to change soon. I'll attempt to point these sore spots out in a self-review so we don't focus on them too hard. Notable changes from last review
Included in this PR
Excluded
What's nextAdding stronger typing to transactions, notes and transaction inputs will inform the structure of things; and I believe note tracking will fall out of this naturally. Following that, implementing the types for batches and blocks makes sense I think. I also want to raise |
| } | ||
|
|
||
| #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
| pub struct BlockNumber(u32); |
There was a problem hiding this comment.
This probably belongs in base? Something else I've been itching to do is new-type all of the different fields in the block header for example. Having all of them be Digest/u32 isn't great.
There was a problem hiding this comment.
Yes, I think moving this to miden-base would be a good idea. Would you like to make a PR there?
For the other fields - also agreed, but maybe we do these incrementally over time?
There was a problem hiding this comment.
Will do. And incrementally is good; no rush :) One can also over do it.
bobbinth
left a comment
There was a problem hiding this comment.
Looks good! Thank you! Not a very thorough review from me - but I did leave a few comments inline (mostly for the future).
| } | ||
|
|
||
| async fn build_and_commit_block(&self, batches: BTreeSet<BatchJobId>) -> Result<(), ()> { | ||
| todo!("Aggregate, prove and commit block"); |
There was a problem hiding this comment.
This will contain building the block and saving it to the store, right?
For block building, since we don't actually prove anything yet, I am thinking we could put in an artificial delay of 3 - 5 seconds to get the conditions close to the real thing.
There was a problem hiding this comment.
Correct. I was also considering injecting failures randomly as mentioned here.
|
|
||
| fn spawn(&mut self, id: BatchJobId, transactions: Vec<TransactionId>) { | ||
| self.0.spawn(async move { | ||
| todo!("Do actual work like aggregating transaction data"); |
There was a problem hiding this comment.
Similar to the above comment: batch building could also take 3 - 5 seconds. Let's artificially simulate this for now.
There was a problem hiding this comment.
I was wondering if we should also simulate failures. So something like sleep a random period, and then randomly fail a batch every now and then.
If we do want this, the failure rate should probably be a configurable parameter. Similar for the entire block.
Though this might annoy users if this causes a user tx to expire. And could look bad ito throughput; though I don't think that's an issue right now (?).
There was a problem hiding this comment.
If this is configurable and not too difficult to add, I don't mind doing it.
Initial outline of the block-producer redesign. Includes a new mempool which can track transaction and batch dependencies as a graph. This enables more complex mempool strategies.
This is still a WIP.
Notably, this only demonstrates the transaction pool implementation and even there skips some parts like handling notes and nullifiers.
Does not include batch producer nor block producer; as I think they should be rather straight-forward but depend on getting the pool interface correct.
We should also discuss how we want to integrate this work once its complete. The work itself is fairly non-trivial so it may be best to keep the existing design in-place until this new design has sufficient confidence. aka I'm not going to replace any existing code until we are very ready - probably split across several PRs?
This work is also missing revoking/dropping transactions e.g. due to recency conditions. This is possible to add, but since we don't actually have this feature yet I've left it out.
Design overview & intent
The pool tracks inflight transactions, batches and blocks. The definition of inflight is extended slightly to include all data that we consider not stale. This includes some number of blocks which we know to already be complete and in the store. This is done to minimize race conditions for new transactions inputs from the store (by providing several block's of overlap/grace).
Transactions, batches and blocks are all handled similarly. They're kept in a pool and are only removed once they're considered stale. Dependency graphs are bi-directional i.e. they store both children and parents (except blocks which are always sequential so no need). Once items are removed due to staleness all references to them are removed. This means all items/graph edges should always be in the pool which we enforce using
.expect(..)as this would indicate a rather serious bug.Transaction and batch selection works by tracking what I called "roots". A node is considered a root if all of its ancestors have already been processed e.g. a transaction root's parents are all already part of some inflight batch, and a batch root's parents are all already part of some inflight block.
Outstanding work
I anticipate the need for a lot of tests unless we can find some way of abstracting parts of this. Though I think in general graph book keeping is just painful so we just need to be thorough.
The rest of the state needs to be added, and the block and batch producers need at least a basic implementation so we can modify the pool api accordingly (missing outputs).
We need to pay particular attention to code that modifies the graphs; and should consider more expensive assertations, or runtime audits until we have performance concerns when running live?
Improve naming and comments which I'll get to.
We need to decide what to do in case of failure. Right now we don't really have any options except abort all transactions forming part of the failure?
Given that this is quite a lot of fairly intertwined code, I'm thinking maybe we should split this PR up more once tests are written? e.g. PR 1 includes just
add_transactionand its tests and then we build from there.