Skip to content

Fast reduction mod poly for TraceMod.#101

Open
gmaxwell wants to merge 1 commit into
bitcoin-core:masterfrom
gmaxwell:fast_tracemod_reducers
Open

Fast reduction mod poly for TraceMod.#101
gmaxwell wants to merge 1 commit into
bitcoin-core:masterfrom
gmaxwell:fast_tracemod_reducers

Conversation

@gmaxwell
Copy link
Copy Markdown
Contributor

@gmaxwell gmaxwell commented May 21, 2026

While working with minisketch for an amateur radio application, I came up with a performance improvement:

TraceMod is the slowest part of recovery at large sizes. The existing code computes the Berlekamp trace by repeatedly squaring the current polynomial, adding the x term, and then reducing the whole result modulo the polynomial being split. That last step is polynomial division. For GF(2^32) it is done 31 times for each trace attempt and, typical for a division, it's quite slow.

This adds a fixed-modulus reducer and uses it in RecFindRoots.

There are two new reduction methods. Both involve precomputation per modulus which is reused when a splitting attempt fails, though the reuse hardly contributes to the performance increase because splitting failures are rare for high degree states.

For smaller degrees it precomputes a table for reducing squares. In characteristic two,

(sum a_i x^i)^2 = sum a_i^2 x^(2*i)

because all the cross terms cancel. So squaring a polynomial is not a polynomial multiplication at all: square each coefficient and move it to twice its old exponent.

For a fixed monic modulus f of degree d the only exponents which can need reduction after a square are even exponents from d through 2*d-2. The reducer precomputes

x^e mod f

for those even e. Then reducing a square just means adding the table row for x^(2i), scaled by a_i^2. If 2i is below d it is copied directly. There is no division in the trace loop and no polynomial multiplication: just field squarings, scalar multiplies, and xors. The table construction still computes remainders, but it is done once for the modulus rather than once per field bit for every trace.

Even for higher degrees the square table approach remains much faster than the original code but the table starts to consume a lot of memory because it's quadratic in d.

For larger degrees the code uses reciprocal reduction with a precomputed inverse. This is the usual trick of replacing repeated division by a fixed divisor with multiplication by an inverse. For division by a monic polynomial f, reverse the high part of the input and multiply it by the truncated power series inverse of reverse(f). This gives the quotient reversed. Reverse it back and subtract q*f to get the remainder.

q_rev = reverse(high(val)) * (1 / reverse(f)) mod x^k
q     = reverse(q_rev)
rem   = val ^ q*f

The inverse is unusually convenient here: reverse(f) has constant term 1, so it has a formal power series inverse. If f*g = 1+e, then in characteristic two

f * (f * g^2) = (f*g)^2 = 1 + e^2

so the Newton step can be written as

g' = f * g^2 mod x^(2*m)

and each step doubles the number of correct coefficients. This also benefits from cheap squaring in characteristic two.

This reciprocal method is only useful if the multiplications in it are not just another quadratic bottleneck. With schoolbook low products it mostly moves the cost around. The large-degree path therefore uses the subquadratic low-product multiplication in this patch; otherwise the reciprocal reducer would not be a win at any size.

The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the reciprocal reducer above it. Even if memory usage were a non-issue the reciprocal method is faster for larger sizes. I errored towards making the cutoff lower in light of the fact that smaller cache systems may prefer lower memory usage. A memory sensitive user might want to reduce the threshold further, there is a pretty wide range where the two approaches have similar performance.

The subquadratic multiplies have a threshold for where karatsuba recursive spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24. On my hardware clmul was fastest with 18..24 while generic preferred 40..48. The choice between these two configurations is pretty arbitary, but picking the optimal for either configuration only hurts the other by a couple percent. Since the reciprocal approach and its subquadratic multipliers are only used for higher degrees I figured non-clmul performance was already a lost cause there and favored the better config for clmul.

An obvious opportunity for further development is hoisting the allocations out of the recursive multipliers.

Because the subquadratic multipliers increases the size of the patch a lot I would have preferred to implement only the square table approach, but I really didn't want to add a big performance cliff from falling back to the old approach or quadratic memory usage for large inputs. Implementing only the reciprocal reducer didn't make sense either because it's slower than the original at very low sizes and even high degree inputs spend a fair amount of time running low degree TraceMods.

TraceMod is the slowest part of recovery at large sizes.  The existing code
computes the Berlekamp trace by repeatedly squaring the current polynomial,
adding the x term, and then reducing the whole result modulo the polynomial
being split.  That last step is polynomial division.  For GF(2^32) it is done
31 times for each trace attempt and, typical for a division, it's quite slow.

This adds a fixed-modulus reducer and uses it in RecFindRoots.

There are two new reduction methods.  Both involve precomputation per modulus
which is reused when a splitting attempt fails, though the reuse hardly
contributes to the performance increase because splitting failures are rare
for high degree states.

For smaller degrees it precomputes a table for reducing squares.  In
characteristic two,

    (sum a_i x^i)^2 = sum a_i^2 x^(2*i)

because all the cross terms cancel.  So squaring a polynomial is not a
polynomial multiplication at all: square each coefficient and move it to twice
its old exponent.

For a fixed monic modulus f of degree d the only exponents which can need
reduction after a square are even exponents from d through 2*d-2.  The reducer
precomputes

    x^e mod f

for those even e.  Then reducing a square just means adding the table row for
x^(2*i), scaled by a_i^2.  If 2*i is below d it is copied directly.  There is
no division in the trace loop and no polynomial multiplication: just field
squarings, scalar multiplies, and xors.  The table construction still computes
remainders, but it is done once for the modulus rather than once per field bit
for every trace.

Even for higher degrees the square table approach remains much faster than
the original code but the table starts to consume a lot of memory because
it's quadratic in d.

For larger degrees the code uses reciprocal reduction with a precomputed
inverse.  This is the usual trick of replacing repeated division by a fixed
divisor with multiplication by an inverse.  For division by a monic polynomial
f, reverse the high part of the input and multiply it by the truncated power
series inverse of reverse(f).  This gives the quotient reversed.  Reverse it
back and subtract q*f to get the remainder.

    q_rev = reverse(high(val)) * (1 / reverse(f)) mod x^k
    q     = reverse(q_rev)
    rem   = val ^ q*f

The inverse is unusually convenient here:  reverse(f) has constant term 1, so it
has a formal power series inverse.  If f*g = 1+e, then in characteristic two

    f * (f * g^2) = (f*g)^2 = 1 + e^2

so the Newton step can be written as

    g' = f * g^2 mod x^(2*m)

and each step doubles the number of correct coefficients.  This also benefits
from cheap squaring in characteristic two.

This reciprocal method is only useful if the multiplications in it are not just
another quadratic bottleneck.  With schoolbook low products it mostly moves the
cost around.  The large-degree path therefore uses the subquadratic low-product
multiplication in this patch; otherwise the reciprocal reducer would not be a
win at any size.

The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the
reciprocal reducer above it.  Even if memory usage were a non-issue the reciprocal
method is faster for larger sizes.  I errored towards making the cutoff lower in
light of the fact that smaller cache systems may prefer lower memory usage.
A memory sensitive user might want to reduce the threshold further, there
is a pretty wide range where the two approaches have similar performance.

The subquadratic multiplies have a threshold for where karatsuba recursive
spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24.
On my hardware clmul was fastest with 18..24 while generic preferred 40..48.
The choice between these two configurations is pretty arbitary, but picking
the optimal for either configuration only hurts the other by a couple percent.
Since the reciprocal approach and its subquadratic multipliers are only used
for higher degrees I figured non-clmul performance was already a lost cause
there and favored the better config for clmul.

An obvious opportunity for further development is hoisting the allocations
out of the recursive multipliers.

Because the subquadratic multipliers increases the size of the patch a lot I
would have preferred to implement only the square table approach, but I
really didn't want to add a big performance cliff from falling back to the
old approach or quadratic memory usage for large inputs.  Implementing only
the reciprocal reducer didn't make sense either because it's slower than
the original at very low sizes and even high degree inputs spend a fair
amount of time running low degree TraceMods.
@gmaxwell
Copy link
Copy Markdown
Contributor Author

gmaxwell commented May 21, 2026

Benchmarks were run with ./bench using errors=syndromes, comparing origin/master against this tree:

30generic

syndromes iters origin ms improved ms speedup
4 10000 0.00594 0.00480 1.238x
8 10000 0.02348 0.01736 1.353x
16 10000 0.07825 0.05296 1.478x
32 10000 0.25672 0.16270 1.578x
64 10000 0.82326 0.50736 1.623x
128 1000 2.75223 1.66350 1.654x
256 1000 9.32600 5.61539 1.661x
512 100 32.90037 22.33600 1.473x
1024 100 119.06383 76.11079 1.564x
2048 100 440.15910 262.30837 1.678x
4096 10 1669.81924 876.16488 1.906x
8192 10 6287.83410 2962.48400 2.122x
16384 10 23767.94820 9720.24374 2.445x

30clmul

syndromes iters origin ms improved ms speedup
4 10000 0.00153 0.00131 1.168x
8 10000 0.00580 0.00449 1.292x
16 10000 0.02051 0.01473 1.392x
32 10000 0.07403 0.04956 1.494x
64 10000 0.26923 0.17042 1.580x
128 1000 0.99520 0.60902 1.634x
256 1000 3.68811 2.24942 1.640x
512 100 13.88752 8.52037 1.630x
1024 100 52.38271 29.75034 1.761x
2048 100 197.72006 99.49323 1.987x
4096 10 749.44188 335.92846 2.231x
8192 10 2835.77094 1100.99565 2.576x
16384 10 10737.12406 3597.37954 2.985x

32generic

syndromes iters origin ms improved ms speedup
4 10000 0.00590 0.00464 1.272x
8 10000 0.02344 0.01700 1.379x
16 10000 0.07898 0.05237 1.508x
32 10000 0.25880 0.16269 1.591x
64 10000 0.85348 0.52498 1.626x
128 1000 2.89707 1.71873 1.686x
256 1000 9.93688 5.88484 1.689x
512 100 35.32780 23.28923 1.517x
1024 100 128.93057 79.72189 1.617x
2048 100 483.84048 274.66377 1.762x
4096 10 1835.96819 927.04517 1.980x
8192 10 6992.64223 3121.30186 2.240x
16384 10 26500.79943 10481.75941 2.528x

32clmul

syndromes iters origin ms improved ms speedup
4 10000 0.00209 0.00173 1.208x
8 10000 0.00866 0.00649 1.334x
16 10000 0.03296 0.02268 1.453x
32 10000 0.12542 0.07989 1.570x
64 10000 0.48076 0.28929 1.662x
128 1000 1.79980 1.07562 1.673x
256 1000 6.79200 4.02129 1.689x
512 100 25.91155 14.50579 1.786x
1024 100 98.52817 49.26626 2.000x
2048 100 375.43699 166.49850 2.255x
4096 10 1431.21744 556.20519 2.573x
8192 10 5457.67815 1813.35139 3.010x
16384 10 20760.05743 5900.72592 3.518x

64generic

syndromes iters origin ms improved ms speedup
4 10000 0.01892 0.01421 1.331x
8 10000 0.08335 0.05650 1.475x
16 10000 0.29244 0.18210 1.606x
32 10000 1.02687 0.59330 1.731x
64 10000 3.48129 1.97271 1.765x
128 1000 11.98930 6.71013 1.787x
256 1000 42.64799 23.48398 1.816x
512 100 156.77761 94.76361 1.654x
1024 100 591.53841 329.36282 1.796x
2048 100 2277.88486 1136.53660 2.004x
4096 10 8825.14955 3839.37698 2.299x
8192 10 34651.40565 12891.50197 2.688x
16384 10 136070.29395 42431.32552 3.207x

64clmul

syndromes iters origin ms improved ms speedup
4 10000 0.00408 0.00331 1.233x
8 10000 0.01807 0.01318 1.371x
16 10000 0.06997 0.04724 1.481x
32 10000 0.26974 0.16630 1.622x
64 10000 1.03576 0.60101 1.723x
128 1000 3.95967 2.23919 1.768x
256 1000 15.24281 8.44897 1.804x
512 100 59.19956 31.02885 1.908x
1024 100 230.09819 107.57759 2.139x
2048 100 897.23526 355.77757 2.522x
4096 10 3513.61774 1154.75472 3.043x
8192 10 13755.81133 3829.66813 3.592x
16384 10 53914.92149 12141.21323 4.441x

I'm interested in what the performance looks like on arm with the clmul PR.

@gmaxwell
Copy link
Copy Markdown
Contributor Author

The test failures on g++ macos look like something wrong with the build host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant