Add new multithreaded TwoQubitPeepholeOptimization pass by mtreinish · Pull Request #13419 · Qiskit/qiskit

mtreinish · 2024-11-10T16:12:54Z

Summary

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

Only supports the target
It doesn't support the XX decomposer because it's not in rust (the TwoQubitBasisDecomposer and TwoQubitControlledUDecomposer are used)

This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of ConsolidateBlocks and UnitarySynthesis
should be used instead if the plugin interface is necessary.

Details and comments

Fixes #12007
Fixes #11659

TODO:

Rebase after Use OnceLock instead of OnceCell #13410 merges
Add tests
Add documentation
Benchmarking and performance tuning
Handle running serially when in multiprocessing context
Release note

coveralls · 2024-11-10T16:37:45Z

Pull Request Test Coverage Report for Build 15654439601

Details

576 of 611 (94.27%) changed or added relevant lines in 12 files are covered.
22 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.02%) to 88.02%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/synthesis/src/two_qubit_decompose.rs	16	17	94.12%
crates/transpiler/src/passes/unitary_synthesis.rs	12	13	92.31%
crates/transpiler/src/passes/two_qubit_unitary_synthesis_utils.rs	133	139	95.68%
crates/transpiler/src/passes/two_qubit_peephole.rs	376	403	93.3%

Files with Coverage Reduction	New Missed Lines	%
crates/qasm2/src/lex.rs	2	92.48%
crates/transpiler/src/passes/unitary_synthesis.rs	2	94.33%
crates/circuit/src/symbol_expr.rs	6	73.69%
crates/qasm2/src/parse.rs	12	96.68%

Totals
Change from base Build 15635869582:	0.02%
Covered Lines:	83570
Relevant Lines:	94944

💛 - Coveralls

This commit adds a new transpiler pass for physical optimization, TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks, ConsolidateBlocks, and UnitarySynthesis in the optimization stage for a default pass manager setup. The pass logically works the same way where it analyzes the dag to get a list of 2q runs, calculates the matrix of each run, and then synthesizes the matrix and substitutes it inplace. The distinction this pass makes though is it does this all in a single pass and also parallelizes the matrix calculation and synthesis steps because there is no data dependency there. This new pass is not meant to fully replace the Collect2qBlocks, ConsolidateBlocks, or UnitarySynthesis passes as those also run in contexts where we don't have a physical circuit. This is meant instead to replace their usage in the optimization stage only. Accordingly this new pass also changes the logic on how we select the synthesis to use and when to make a substituion. Previously this logic was primarily done via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if the number of basis gates needed based on the weyl chamber coordinates was less than the number of 2q gates in the block (see Qiskit#11659 for discussion on this). Since this new pass skips the explicit consolidation stage we go ahead and try all the available synthesizers Right now this commit has a number of limitations, the largest are: - Only supports the target - It doesn't support any synthesizers besides the TwoQubitBasisDecomposer, because it's the only one in rust currently. For plugin handling I left the logic as running the three pass series, but I'm not sure this is the behavior we want. We could say keep the synthesis plugins for `UnitarySynthesis` only and then rely on our built-in methods for physical optimiztion only. But this also seems less than ideal because the plugin mechanism is how we support synthesizing to custom basis gates, and also more advanced approximate synthesis methods. Both of those are things we need to do as part of the synthesis here. Additionally, this is currently missing tests and documentation and while running it manually "works" as in it returns a circuit that looks valid, I've not done any validation yet. This also likely will need several rounds of performance optimization and tuning. t this point this is just a rough proof of concept and will need a lof refinement along with larger changes to Qiskit's rust code before this is ready to merge. Fixes Qiskit#12007 Fixes Qiskit#11659

…rallel-pass

Since Qiskit#13139 merged we have another two qubit decomposer available to run in rust, the TwoQubitControlledUDecomposer. This commit updates the new TwoQubitPeepholeOptimization to call this decomposer if the target supports appropriate 2q gates.

Clippy is correctly warning that the size difference between the two decomposer types in the TwoQubitDecomposer enumese two types is large. TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer is only 24 bytes. This means each element of ControlledU is wasting > 1600 bytes. However, in this case that is acceptable in order to avoid a layer of pointer indirection as these are stored temporarily in a vec inside a thread to decompose a unitary. A trait would be more natural for this to define a common interface between all the two qubit decomposers but since we keep them instantiated for each edge in a Vec they need to be sized and doing something like `Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer` instead of a enum) to get around this would have additional runtime overhead. This is also considering that TwoQubitControlledUDecomposer has far less likelihood in practice as it only works with some targets that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.

…rallel-pass

Also don't run scoring more than needed.

ShellyGarion · 2025-01-23T07:02:09Z

Copy here the comment of @t-imamichi #13568 (comment)
and my reply: #13568 (comment)

I think this closes #13428. How about adding a test case of consecutive RZZ (RXX, and RYY) gates?

We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates .

…rallel-pass

mtreinish · 2025-01-26T14:09:40Z

I added support for using the ControlledUDecomposer to the new pass back in early December with this commit: 746758f although looking at that now with fresh eyes I need to check that the gate is continuous in the target, right now it only looks at the supported gate types.

…tinuous

Prior to this commit there was a bug in the dag reconstruction logic for specific subcircuit formations that could cause the synthesized block to be inserted before it's parent nodes in the dag. Specfically if you think of a circuit such as as: 0: h(1) 1: cx(3, 2) 2: cx(1, 2) 3: cx(2, 1) (not an exact example but showcases it well). Nodes 0, 2, and 3 form a block that can be optimized through peephole resynthesis. The previous logic for reconstruction could encounter node 0 first when iterating over the nodes in topological order and then would proceed to insert the entire contents of the synthesized block before node 1. This would be invalid because node 1 is a parent of 2 and 3. This commit fixes this by holding the block insertion until the first 2q node instead of the first node.

…rallel-pass

The score needs to be less to be improved, before we were comparing fidelities as the second field, less fidelity is worse and we shouldn't prefer that. This switches to use error for the second field in the score comparison to ensure that less is better.

…ze>>

mtreinish · 2026-03-18T02:42:48Z

Is this PR ready for 2.4?

No, it's much closer than before but still not ready so it'll have to be for 2.5. It seems to be fully functional now but I still need to do benchmarking and performance tuning to do. I took some shortcuts that aren't optimal to get a working baseline as a starting point that I need to fix. I also want to do some small code cleanup after all that.

How does it handle gates with cost 0 like RZ? does it reduce their count?

The heuristic used for when to optimize a block has 3 things it looks at:

it looks at the total 2q gate count in the block and uses the lowest.
if the 2q count is the same it looks at the error estimate (based on the error rate in the target). If rz has 0 error it would not contribute to this metric.
if the error estimates are the same then the fallback is to pick the result that has the fewest total gates.

This is used both to pick the best synthesis from all the available decomposers and also used to decide whether to replace the block with the best synthesis. If all 3 are a tie then it also defaults to not replacing the block

This commit moves away from using a unique UnitarySynthesisState instance for each block run in the parallel iterator to using a thread local cache. This means we'll have n caches that are reused per thread. The downside with this is potentially re-computating the same set of decomposers between threads.The other tradeoff is this requires a refcell for dynamic borrow checking as the threadlocal storage only returns `&` references and not mutables ones. The dynamic borrow checking will have a runtime overhead to access the cache. In practice the overhead of this approach should be minimal especially compared t some of the other options and this approach speeds the pass up about 7%. As the cache struct used in the state object is not thread safe as at it's core it's based around a IndexMap to map the qubits to the decomposers to use. While we could look at refactoring the struct to be internally threadsafe this would rely on some form of synchronization which has the potential for contention between threads. Either we have to internally use a RWLock or mutex around the internal fields on the cache struct or we could look at using dashmap in place of hashmap (but not the indexmap) to move the synchronization to the shards/chunks internally to reduce contention. But the locking necessary around the Decomposer2qCacheInner would cause a large amount of contention because that's the core structure that will be held by a thread for the most expensive operation in a thread, running the decomposition. Another, option would be to use a per qarg cache so we have an outer dashmap instance that is keyed on the qargs array and the value is the decomposer cache instance for that qarg. This would work fine and would avoid the overhead of having to potentially compute the decomposers multiple times it would use slightly more memory (assuming there are more qargs in the target than threads). There is still some synchronization if two threads are trying to compute decomposition on the same qargs at the same time (or if qargs on the same shard). This would probably be a viable alternative (I tested it locally but didn't get accurate benchmarks before switching to thread local storage). There is also the option to put the state object behind a mutex but this has the same issues as making the cache threadsafe the decomposer cache will effectively serialize the unitary synthesis calls because the lock will be held by a thread for the duration of the decomposition.

…rallel-pass

This commit takes the single crates/synthesis/src/two_qubit_decompose.rs file and splits it into a directory that contains several submodule files. The original file kind of organically mirrored the python structure which was to have all the two qubit decomposer implementations in a single giant file. This was an anti-pattern in python and has led to us missing many details over time because it was impossible to keep the entire contents of the file in your head or on screen. This has also translated to the Rust version of this module as I recently discovered during the development of Qiskit#13419 that there is a duplicated method with two different names on the two qubit basis decomposer. The implementations actually are right next to each other in the impl block. Nobody noticed because the file was too big. This commit doesn't fix this as I wanted this migration to not change any code if possible. The new structure of this module is to break out each decomposer implementation into a standalone file. If we add new decomposers in the future they will naturally fit into a new file. The only small changes made to the code in this commit are mostly about visibility of functions and structs because things need to be at least pub(super) if they cross a file boundary now. A few places were made fully pub or pub(crate) depending on how they're used. Python things were made full pub to indicate that even if pyo3 doesn't require it. There was one small code change made as part of this in the `TwoQubitBasisDecomposer::__getnewargs__` method it was previously accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then converting that to a python array and finally a Py<PyAny>. However there was a dedicated method for that already `TwoQubitWeylDecomposer::unitary_matrix()`, I switched the implementation to just use that method instead. I originally did this in order to avoid changing the visibility on the unitary_matrix field, but I had to do that for other places, but I decided not to change this back because it's a more natural fit.

…rallel-pass

…instead of panic

This commit adds an internal dedicated function to the UnitarySynthesis function to enable other passes that are synthesizes unitaries to reuse the logic from UnitarySynthesis. We're looking at adding new passes that will do 2q synthesized to a target, specifically the two qubit peephole optimization pass being added in Qiskit#13419. The new interface in this PR lets that pass (and potential future passes) reuse the logic from unitary synthesis without the rest of the infrastructure. This was originally part of Qiskit#13419 but this opts to split it out since it's logically independent from the PR adding the new pass.

This commit takes the single crates/synthesis/src/two_qubit_decompose.rs file and splits it into a directory that contains several submodule files. The original file kind of organically mirrored the python structure which was to have all the two qubit decomposer implementations in a single giant file. This was an anti-pattern in python and has led to us missing many details over time because it was impossible to keep the entire contents of the file in your head or on screen. This has also translated to the Rust version of this module as I recently discovered during the development of #13419 that there is a duplicated method with two different names on the two qubit basis decomposer. The implementations actually are right next to each other in the impl block. Nobody noticed because the file was too big. This commit doesn't fix this as I wanted this migration to not change any code if possible. The new structure of this module is to break out each decomposer implementation into a standalone file. If we add new decomposers in the future they will naturally fit into a new file. The only small changes made to the code in this commit are mostly about visibility of functions and structs because things need to be at least pub(super) if they cross a file boundary now. A few places were made fully pub or pub(crate) depending on how they're used. Python things were made full pub to indicate that even if pyo3 doesn't require it. There was one small code change made as part of this in the `TwoQubitBasisDecomposer::__getnewargs__` method it was previously accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then converting that to a python array and finally a Py<PyAny>. However there was a dedicated method for that already `TwoQubitWeylDecomposer::unitary_matrix()`, I switched the implementation to just use that method instead. I originally did this in order to avoid changing the visibility on the unitary_matrix field, but I had to do that for other places, but I decided not to change this back because it's a more natural fit. Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>

…rallel-pass

This commit fixes an edge case in the scoring when there was only a single decomposition available. In UnitarySynthesis for these cases the fidelity calculation is not run as part of the synthesis because there is no reason to compute the estimated score when it's not being compared to anything. However, for the new peephole pass we need to run the scoring because the estimated fidelity is used to compare against the original circuit's block to determine whether to replace it or not. Previously, the new peephole pass treated the lack of a score as it being ideal, which was incorrect and led to the pass replacing every block unless the original circuit's block had more 2q gates. This fixes this and now runs the scoring manually if there is not one returned by the synthesis.

mtreinish added performance Changelog: Added Add an "Added" entry in the GitHub Release changelog. Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Nov 10, 2024

mtreinish added this to the 2.0.0 milestone Nov 10, 2024

This was referenced Nov 11, 2024

Use OnceLock instead of OnceCell #13410

Merged

Suboptimal optimization result of consecutive RZZ gates with optimization level 2 and 3 #13431

Open

mtreinish force-pushed the two-qubit-peephole-parallel-pass branch from ad06d1a to 4d160bc Compare November 14, 2024 12:34

mtreinish mentioned this pull request Nov 18, 2024

Fix post-oxidization change in ConsolidateBlocks behavior #13450

Merged

mtreinish added 8 commits December 1, 2024 02:05

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

decee9a

…rallel-pass

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

4d4df68

…rallel-pass

Embed 2q gate count into score as tie breaker

cb6b70f

Also don't run scoring more than needed.

Release GIL during parallel portion

f06a070

Merge branch 'main' into two-qubit-peephole-parallel-pass

90b16e8

Fix lint

a175ee8

ShellyGarion mentioned this pull request Jan 23, 2025

Add 2q fractional gates to the UnitarySynthesis transpiler pass #13568

Merged

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

af0c144

…rallel-pass

mtreinish added 4 commits January 26, 2025 09:28

Update ControlledUDecomposer to ensure we only run if the gate is con…

79a46c5

…tinuous

Add reversed synthesis for two qubit basis decomposer

839b4c9

Fix handling of single direction gates

d9399a6

Fix import cycle

b4c4360

ShellyGarion self-assigned this Feb 3, 2025

Merge branch 'main' into two-qubit-peephole-parallel-pass

aefdc90

1ucian0 assigned mtreinish Feb 6, 2025

mtreinish removed the on hold Can not fix yet label Mar 17, 2026

mtreinish removed this from Qiskit 2.4 Mar 17, 2026

mtreinish added this to Qiskit 2.5 Mar 17, 2026

github-project-automation bot moved this to Ready in Qiskit 2.5 Mar 17, 2026

mtreinish modified the milestones: 2.4.0, 2.5.0 Mar 17, 2026

mtreinish moved this from Ready to In development in Qiskit 2.5 Mar 17, 2026

mtreinish added 6 commits March 17, 2026 12:30

Remove smallvec usage for original sequence scoring

79f2466

Fix original score creation

1cdaa61

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

4503c14

…rallel-pass

Use Dashmap<NodeIndex, usize> instead of Mutex<HashMap<NodeIndex, usi…

8bcc573

…ze>>

mtreinish added 5 commits March 18, 2026 00:18

Fix doc indent

76b40ab

Appease the lint check changing http->https in the license header

76f5935

Fix release note

2d40a18

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

b88de4a

…rallel-pass

mtreinish mentioned this pull request Mar 18, 2026

Split two qubit decomposer file into separate files #15833

Merged

mtreinish added 2 commits March 19, 2026 12:31

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

ce3dddf

…rallel-pass

Return a user error from a decomposer build with an incorrect string …

158725f

…instead of panic

mtreinish mentioned this pull request Mar 20, 2026

Add internal direct 2q synthesis interface to UnitarySynthesis #15845

Open

mtreinish added 2 commits March 23, 2026 09:41

Merge remote-tracking branch 'origin/main' into two-qubit-peephole-pa…

a8a3fe7

…rallel-pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new multithreaded TwoQubitPeepholeOptimization pass#13419

Add new multithreaded TwoQubitPeepholeOptimization pass#13419
mtreinish wants to merge 67 commits intoQiskit:mainfrom
mtreinish:two-qubit-peephole-parallel-pass

mtreinish commented Nov 10, 2024 •

edited

Loading

Uh oh!

coveralls commented Nov 10, 2024 •

edited

Loading

Uh oh!

ShellyGarion commented Jan 23, 2025

Uh oh!

mtreinish commented Jan 26, 2025

Uh oh!

mtreinish commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

mtreinish commented Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details and comments

Uh oh!

coveralls commented Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15654439601

Details

💛 - Coveralls

Uh oh!

ShellyGarion commented Jan 23, 2025

Uh oh!

mtreinish commented Jan 26, 2025

Uh oh!

mtreinish commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mtreinish commented Nov 10, 2024 •

edited

Loading

coveralls commented Nov 10, 2024 •

edited

Loading