Skip to content

Add new multithreaded TwoQubitPeepholeOptimization pass#13419

Open
mtreinish wants to merge 67 commits intoQiskit:mainfrom
mtreinish:two-qubit-peephole-parallel-pass
Open

Add new multithreaded TwoQubitPeepholeOptimization pass#13419
mtreinish wants to merge 67 commits intoQiskit:mainfrom
mtreinish:two-qubit-peephole-parallel-pass

Conversation

@mtreinish
Copy link
Member

@mtreinish mtreinish commented Nov 10, 2024

Summary

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

  • Only supports the target
  • It doesn't support the XX decomposer because it's not in rust (the TwoQubitBasisDecomposer and TwoQubitControlledUDecomposer are used)

This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of ConsolidateBlocks and UnitarySynthesis
should be used instead if the plugin interface is necessary.

Details and comments

Fixes #12007
Fixes #11659

TODO:

@mtreinish mtreinish added performance Changelog: Added Add an "Added" entry in the GitHub Release changelog. Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Nov 10, 2024
@mtreinish mtreinish added this to the 2.0.0 milestone Nov 10, 2024
@coveralls
Copy link

coveralls commented Nov 10, 2024

Pull Request Test Coverage Report for Build 15654439601

Details

  • 576 of 611 (94.27%) changed or added relevant lines in 12 files are covered.
  • 22 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.02%) to 88.02%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/synthesis/src/two_qubit_decompose.rs 16 17 94.12%
crates/transpiler/src/passes/unitary_synthesis.rs 12 13 92.31%
crates/transpiler/src/passes/two_qubit_unitary_synthesis_utils.rs 133 139 95.68%
crates/transpiler/src/passes/two_qubit_peephole.rs 376 403 93.3%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/lex.rs 2 92.48%
crates/transpiler/src/passes/unitary_synthesis.rs 2 94.33%
crates/circuit/src/symbol_expr.rs 6 73.69%
crates/qasm2/src/parse.rs 12 96.68%
Totals Coverage Status
Change from base Build 15635869582: 0.02%
Covered Lines: 83570
Relevant Lines: 94944

💛 - Coveralls

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to
run in rust, the TwoQubitControlledUDecomposer. This commit updates the
new TwoQubitPeepholeOptimization to call this decomposer if the target
supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two
decomposer types in the TwoQubitDecomposer enumese two types is large.
TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer
is only 24 bytes. This means each element of ControlledU is wasting
> 1600 bytes. However, in this case that is acceptable in order to
avoid a layer of pointer indirection as these are stored temporarily
in a vec inside a thread to decompose a unitary. A trait would be more
natural for this to define a common interface between all the two qubit
decomposers but since we keep them instantiated for each edge in a Vec
they need to be sized and doing something like
`Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer`
instead of a enum) to get around this would have additional runtime
overhead. This is also considering that TwoQubitControlledUDecomposer
has far less likelihood in practice as it only works with some targets
that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
@ShellyGarion
Copy link
Member

Copy here the comment of @t-imamichi #13568 (comment)
and my reply: #13568 (comment)

I think this closes #13428. How about adding a test case of consecutive RZZ (RXX, and RYY) gates?

We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates .

@mtreinish
Copy link
Member Author

I added support for using the ControlledUDecomposer to the new pass back in early December with this commit: 746758f although looking at that now with fresh eyes I need to check that the gate is continuous in the target, right now it only looks at the supported gate types.

@ShellyGarion ShellyGarion self-assigned this Feb 3, 2025
@mtreinish mtreinish removed the on hold Can not fix yet label Mar 17, 2026
@mtreinish mtreinish removed this from Qiskit 2.4 Mar 17, 2026
@mtreinish mtreinish modified the milestones: 2.4.0, 2.5.0 Mar 17, 2026
@mtreinish mtreinish moved this from Ready to In development in Qiskit 2.5 Mar 17, 2026
Prior to this commit there was a bug in the dag reconstruction logic for
specific subcircuit formations that could cause the synthesized block to
be inserted before it's parent nodes in the dag. Specfically if you
think of a circuit such as as:

0: h(1)
1: cx(3, 2)
2: cx(1, 2)
3: cx(2, 1)

(not an exact example but showcases it well). Nodes 0, 2, and 3 form a
block that can be optimized through peephole resynthesis. The previous
logic for reconstruction could encounter node 0 first when iterating
over the nodes in topological order and then would proceed to insert
the entire contents of the synthesized block before node 1. This would
be invalid because node 1 is a parent of 2 and 3. This commit fixes this
by holding the block insertion until the first 2q node instead of the
first node.
The score needs to be less to be improved, before we were comparing
fidelities as the second field, less fidelity is worse and we shouldn't
prefer that. This switches to use error for the second field in the
score comparison to ensure that less is better.
@mtreinish
Copy link
Member Author

Is this PR ready for 2.4?

No, it's much closer than before but still not ready so it'll have to be for 2.5. It seems to be fully functional now but I still need to do benchmarking and performance tuning to do. I took some shortcuts that aren't optimal to get a working baseline as a starting point that I need to fix. I also want to do some small code cleanup after all that.

How does it handle gates with cost 0 like RZ? does it reduce their count?

The heuristic used for when to optimize a block has 3 things it looks at:

  1. it looks at the total 2q gate count in the block and uses the lowest.
  2. if the 2q count is the same it looks at the error estimate (based on the error rate in the target). If rz has 0 error it would not contribute to this metric.
  3. if the error estimates are the same then the fallback is to pick the result that has the fewest total gates.

This is used both to pick the best synthesis from all the available decomposers and also used to decide whether to replace the block with the best synthesis. If all 3 are a tie then it also defaults to not replacing the block

This commit moves away from using a unique UnitarySynthesisState
instance for each block run in the parallel iterator to using a thread
local cache. This means we'll have n caches that are reused per thread.
The downside with this is potentially re-computating the same set of
decomposers between threads.The other tradeoff is this requires a
refcell for dynamic borrow checking as the threadlocal storage only
returns `&` references and not mutables ones. The dynamic borrow
checking will have a runtime overhead to access the cache.  In practice
the overhead of this approach should be minimal especially compared t
some of the other options and this approach speeds the pass up about 7%.

As the cache struct used in the state object is not thread safe as at
it's core it's based around a IndexMap to map the qubits to the
decomposers to use. While we could look at refactoring the struct to
be internally threadsafe this would rely on some form of
synchronization which has the potential for contention between
threads. Either we have to internally use a RWLock
or mutex around the internal fields on the cache struct or we could look
at using dashmap in place of hashmap (but not the indexmap) to move the
synchronization to the shards/chunks internally to reduce contention.
But the locking necessary around the Decomposer2qCacheInner would cause
a large amount of contention because that's the core structure that will
be held by a thread for the most expensive operation in a thread,
running the decomposition.

Another, option would be to use a per qarg cache so we have
an outer dashmap instance that is keyed on the qargs array and the value
is the decomposer cache instance for that qarg. This would work fine and
would avoid the overhead of having to potentially compute the decomposers
multiple times it would use slightly more memory (assuming there are
more qargs in the target than threads). There is still some
synchronization if two threads are trying to compute decomposition on
the same qargs at the same time (or if qargs on the same shard). This
would probably be a viable alternative (I tested it locally but didn't
get accurate benchmarks before switching to thread local storage).

There is also the option to put the state object behind a mutex but this
has the same issues as making the cache threadsafe the decomposer cache
will effectively serialize the unitary synthesis calls because the lock
will be held by a thread for the duration of the decomposition.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Mar 18, 2026
This commit takes the single crates/synthesis/src/two_qubit_decompose.rs
file and splits it into a directory that contains several submodule
files. The original file kind of organically mirrored the python
structure which was to have all the two qubit decomposer implementations
in a single giant file. This was an anti-pattern in python and has led
to us missing many details over time because it was impossible to keep
the entire contents of the file in your head or on screen. This has also
translated to the Rust version of this module as I recently discovered
during the development of Qiskit#13419 that there is a duplicated method with
two different names on the two qubit basis decomposer. The
implementations actually are right next to each other in the impl block.
Nobody noticed because the file was too big. This commit doesn't fix
this as I wanted this migration to not change any code if possible. The
new structure of this module is to break out each decomposer
implementation into a standalone file. If we add new decomposers in the
future they will naturally fit into a new file.

The only small changes made to the code in this commit are mostly about
visibility of functions and structs because things need to be at least
pub(super) if they cross a file boundary now. A few places were made
fully pub or pub(crate) depending on how they're used. Python things
were made full pub to indicate that even if pyo3 doesn't require it.
There was one small code change made as part of this in the
`TwoQubitBasisDecomposer::__getnewargs__` method it was previously
accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then
converting that to a python array and finally a Py<PyAny>. However there
was a dedicated method for that already
`TwoQubitWeylDecomposer::unitary_matrix()`, I switched the
implementation to just use that method instead. I originally did this in
order to avoid changing the visibility on the unitary_matrix field, but
I had to do that for other places, but I decided not to change this back
because it's a more natural fit.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Mar 20, 2026
This commit adds an internal dedicated function to the UnitarySynthesis
function to enable other passes that are synthesizes unitaries to reuse
the logic from UnitarySynthesis. We're looking at adding new passes that
will do 2q synthesized to a target, specifically the two qubit peephole
optimization pass being added in Qiskit#13419. The new interface in this PR
lets that pass (and potential future passes) reuse the logic from
unitary synthesis without the rest of the infrastructure. This was
originally part of Qiskit#13419 but this opts to split it out since it's
logically independent from the PR adding the new pass.
github-merge-queue bot pushed a commit that referenced this pull request Mar 22, 2026
This commit takes the single crates/synthesis/src/two_qubit_decompose.rs
file and splits it into a directory that contains several submodule
files. The original file kind of organically mirrored the python
structure which was to have all the two qubit decomposer implementations
in a single giant file. This was an anti-pattern in python and has led
to us missing many details over time because it was impossible to keep
the entire contents of the file in your head or on screen. This has also
translated to the Rust version of this module as I recently discovered
during the development of #13419 that there is a duplicated method with
two different names on the two qubit basis decomposer. The
implementations actually are right next to each other in the impl block.
Nobody noticed because the file was too big. This commit doesn't fix
this as I wanted this migration to not change any code if possible. The
new structure of this module is to break out each decomposer
implementation into a standalone file. If we add new decomposers in the
future they will naturally fit into a new file.

The only small changes made to the code in this commit are mostly about
visibility of functions and structs because things need to be at least
pub(super) if they cross a file boundary now. A few places were made
fully pub or pub(crate) depending on how they're used. Python things
were made full pub to indicate that even if pyo3 doesn't require it.
There was one small code change made as part of this in the
`TwoQubitBasisDecomposer::__getnewargs__` method it was previously
accessing the `TwoQubitWeylDecomposer`'s `unitary_matrix` field and then
converting that to a python array and finally a Py<PyAny>. However there
was a dedicated method for that already
`TwoQubitWeylDecomposer::unitary_matrix()`, I switched the
implementation to just use that method instead. I originally did this in
order to avoid changing the visibility on the unitary_matrix field, but
I had to do that for other places, but I decided not to change this back
because it's a more natural fit.

Co-authored-by: Alexander Ivrii <alexi@il.ibm.com>
This commit fixes an edge case in the scoring when there was only a
single decomposition available. In UnitarySynthesis for these cases
the fidelity calculation is not run as part of the synthesis because there
is no reason to compute the estimated score when it's not being compared
to anything. However, for the new peephole pass we need to run the
scoring because the estimated fidelity is used to compare against the
original circuit's block to determine whether to replace it or not.
Previously, the new peephole pass treated the lack of a score as it
being ideal, which was incorrect and led to the pass replacing every
block unless the original circuit's block had more 2q gates. This fixes
this and now runs the scoring manually if there is not one returned by
the synthesis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changelog: Added Add an "Added" entry in the GitHub Release changelog. mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository

Projects

Status: In development

Development

Successfully merging this pull request may close these issues.

Add parallel synthesis interface to default unitary synthesis plugin ConsolidateBlocks does not have a good logic for heterogeneous gates

9 participants