Add new multithreaded TwoQubitPeepholeOptimization pass#13419
Add new multithreaded TwoQubitPeepholeOptimization pass#13419mtreinish wants to merge 67 commits intoQiskit:mainfrom
Conversation
Pull Request Test Coverage Report for Build 15654439601Details
💛 - Coveralls |
ad06d1a to
4d160bc
Compare
This commit adds a new transpiler pass for physical optimization, TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks, ConsolidateBlocks, and UnitarySynthesis in the optimization stage for a default pass manager setup. The pass logically works the same way where it analyzes the dag to get a list of 2q runs, calculates the matrix of each run, and then synthesizes the matrix and substitutes it inplace. The distinction this pass makes though is it does this all in a single pass and also parallelizes the matrix calculation and synthesis steps because there is no data dependency there. This new pass is not meant to fully replace the Collect2qBlocks, ConsolidateBlocks, or UnitarySynthesis passes as those also run in contexts where we don't have a physical circuit. This is meant instead to replace their usage in the optimization stage only. Accordingly this new pass also changes the logic on how we select the synthesis to use and when to make a substituion. Previously this logic was primarily done via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if the number of basis gates needed based on the weyl chamber coordinates was less than the number of 2q gates in the block (see Qiskit#11659 for discussion on this). Since this new pass skips the explicit consolidation stage we go ahead and try all the available synthesizers Right now this commit has a number of limitations, the largest are: - Only supports the target - It doesn't support any synthesizers besides the TwoQubitBasisDecomposer, because it's the only one in rust currently. For plugin handling I left the logic as running the three pass series, but I'm not sure this is the behavior we want. We could say keep the synthesis plugins for `UnitarySynthesis` only and then rely on our built-in methods for physical optimiztion only. But this also seems less than ideal because the plugin mechanism is how we support synthesizing to custom basis gates, and also more advanced approximate synthesis methods. Both of those are things we need to do as part of the synthesis here. Additionally, this is currently missing tests and documentation and while running it manually "works" as in it returns a circuit that looks valid, I've not done any validation yet. This also likely will need several rounds of performance optimization and tuning. t this point this is just a rough proof of concept and will need a lof refinement along with larger changes to Qiskit's rust code before this is ready to merge. Fixes Qiskit#12007 Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to run in rust, the TwoQubitControlledUDecomposer. This commit updates the new TwoQubitPeepholeOptimization to call this decomposer if the target supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two decomposer types in the TwoQubitDecomposer enumese two types is large. TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer is only 24 bytes. This means each element of ControlledU is wasting > 1600 bytes. However, in this case that is acceptable in order to avoid a layer of pointer indirection as these are stored temporarily in a vec inside a thread to decompose a unitary. A trait would be more natural for this to define a common interface between all the two qubit decomposers but since we keep them instantiated for each edge in a Vec they need to be sized and doing something like `Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer` instead of a enum) to get around this would have additional runtime overhead. This is also considering that TwoQubitControlledUDecomposer has far less likelihood in practice as it only works with some targets that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
|
Copy here the comment of @t-imamichi #13568 (comment)
We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates . |
|
I added support for using the |
Prior to this commit there was a bug in the dag reconstruction logic for specific subcircuit formations that could cause the synthesized block to be inserted before it's parent nodes in the dag. Specfically if you think of a circuit such as as: 0: h(1) 1: cx(3, 2) 2: cx(1, 2) 3: cx(2, 1) (not an exact example but showcases it well). Nodes 0, 2, and 3 form a block that can be optimized through peephole resynthesis. The previous logic for reconstruction could encounter node 0 first when iterating over the nodes in topological order and then would proceed to insert the entire contents of the synthesized block before node 1. This would be invalid because node 1 is a parent of 2 and 3. This commit fixes this by holding the block insertion until the first 2q node instead of the first node.
The score needs to be less to be improved, before we were comparing fidelities as the second field, less fidelity is worse and we shouldn't prefer that. This switches to use error for the second field in the score comparison to ensure that less is better.
No, it's much closer than before but still not ready so it'll have to be for 2.5. It seems to be fully functional now but I still need to do benchmarking and performance tuning to do. I took some shortcuts that aren't optimal to get a working baseline as a starting point that I need to fix. I also want to do some small code cleanup after all that.
The heuristic used for when to optimize a block has 3 things it looks at:
This is used both to pick the best synthesis from all the available decomposers and also used to decide whether to replace the block with the best synthesis. If all 3 are a tie then it also defaults to not replacing the block |
This commit moves away from using a unique UnitarySynthesisState instance for each block run in the parallel iterator to using a thread local cache. This means we'll have n caches that are reused per thread. The downside with this is potentially re-computating the same set of decomposers between threads.The other tradeoff is this requires a refcell for dynamic borrow checking as the threadlocal storage only returns `&` references and not mutables ones. The dynamic borrow checking will have a runtime overhead to access the cache. In practice the overhead of this approach should be minimal especially compared t some of the other options and this approach speeds the pass up about 7%. As the cache struct used in the state object is not thread safe as at it's core it's based around a IndexMap to map the qubits to the decomposers to use. While we could look at refactoring the struct to be internally threadsafe this would rely on some form of synchronization which has the potential for contention between threads. Either we have to internally use a RWLock or mutex around the internal fields on the cache struct or we could look at using dashmap in place of hashmap (but not the indexmap) to move the synchronization to the shards/chunks internally to reduce contention. But the locking necessary around the Decomposer2qCacheInner would cause a large amount of contention because that's the core structure that will be held by a thread for the most expensive operation in a thread, running the decomposition. Another, option would be to use a per qarg cache so we have an outer dashmap instance that is keyed on the qargs array and the value is the decomposer cache instance for that qarg. This would work fine and would avoid the overhead of having to potentially compute the decomposers multiple times it would use slightly more memory (assuming there are more qargs in the target than threads). There is still some synchronization if two threads are trying to compute decomposition on the same qargs at the same time (or if qargs on the same shard). This would probably be a viable alternative (I tested it locally but didn't get accurate benchmarks before switching to thread local storage). There is also the option to put the state object behind a mutex but this has the same issues as making the cache threadsafe the decomposer cache will effectively serialize the unitary synthesis calls because the lock will be held by a thread for the duration of the decomposition.
This commit takes the single crates/synthesis/src/two_qubit_decompose.rs file and splits it into a directory that contains several submodule files. The original file kind of organically mirrored the python structure which was to have all the two qubit decomposer implementations in a single giant file. This was an anti-pattern in python and has led to us missing many details over time because it was impossible to keep the entire contents of the file in your head or on screen. This has also translated to the Rust version of this module as I recently discovered during the development of Qiskit#13419 that there is a duplicated method with two different names on the two qubit basis decomposer. The implementations actually are right next to each other in the impl block. Nobody noticed because the file was too big. This commit doesn't fix this as I wanted this migration to not change any code if possible. The new structure of this module is to break out each decomposer implementation into a standalone file. If we add new decomposers in the future they will naturally fit into a new file. The only small changes made to the code in this commit are mostly about visibility of functions and structs because things need to be at least pub(super) if they cross a file boundary now. A few places were made fully pub or pub(crate) depending on how they're used. Python things were made full pub to indicate that even if pyo3 doesn't require it. There was one small code change made as part of this in the `TwoQubitBasisDecomposer::__getnewargs__` method it was previously accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then converting that to a python array and finally a Py<PyAny>. However there was a dedicated method for that already `TwoQubitWeylDecomposer::unitary_matrix()`, I switched the implementation to just use that method instead. I originally did this in order to avoid changing the visibility on the unitary_matrix field, but I had to do that for other places, but I decided not to change this back because it's a more natural fit.
This commit adds an internal dedicated function to the UnitarySynthesis function to enable other passes that are synthesizes unitaries to reuse the logic from UnitarySynthesis. We're looking at adding new passes that will do 2q synthesized to a target, specifically the two qubit peephole optimization pass being added in Qiskit#13419. The new interface in this PR lets that pass (and potential future passes) reuse the logic from unitary synthesis without the rest of the infrastructure. This was originally part of Qiskit#13419 but this opts to split it out since it's logically independent from the PR adding the new pass.
This commit takes the single crates/synthesis/src/two_qubit_decompose.rs file and splits it into a directory that contains several submodule files. The original file kind of organically mirrored the python structure which was to have all the two qubit decomposer implementations in a single giant file. This was an anti-pattern in python and has led to us missing many details over time because it was impossible to keep the entire contents of the file in your head or on screen. This has also translated to the Rust version of this module as I recently discovered during the development of #13419 that there is a duplicated method with two different names on the two qubit basis decomposer. The implementations actually are right next to each other in the impl block. Nobody noticed because the file was too big. This commit doesn't fix this as I wanted this migration to not change any code if possible. The new structure of this module is to break out each decomposer implementation into a standalone file. If we add new decomposers in the future they will naturally fit into a new file. The only small changes made to the code in this commit are mostly about visibility of functions and structs because things need to be at least pub(super) if they cross a file boundary now. A few places were made fully pub or pub(crate) depending on how they're used. Python things were made full pub to indicate that even if pyo3 doesn't require it. There was one small code change made as part of this in the `TwoQubitBasisDecomposer::__getnewargs__` method it was previously accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then converting that to a python array and finally a Py<PyAny>. However there was a dedicated method for that already `TwoQubitWeylDecomposer::unitary_matrix()`, I switched the implementation to just use that method instead. I originally did this in order to avoid changing the visibility on the unitary_matrix field, but I had to do that for other places, but I decided not to change this back because it's a more natural fit. Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
This commit fixes an edge case in the scoring when there was only a single decomposition available. In UnitarySynthesis for these cases the fidelity calculation is not run as part of the synthesis because there is no reason to compute the estimated score when it's not being compared to anything. However, for the new peephole pass we need to run the scoring because the estimated fidelity is used to compare against the original circuit's block to determine whether to replace it or not. Previously, the new peephole pass treated the lack of a score as it being ideal, which was incorrect and led to the pass replacing every block unless the original circuit's block had more 2q gates. This fixes this and now runs the scoring manually if there is not one returned by the synthesis.
Summary
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.
This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers
Right now this commit has a number of limitations, the largest are:
TwoQubitBasisDecomposerandTwoQubitControlledUDecomposerare used)This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of
ConsolidateBlocksandUnitarySynthesisshould be used instead if the plugin interface is necessary.
Details and comments
Fixes #12007
Fixes #11659
TODO: