Fast reduction mod poly for TraceMod. by gmaxwell · Pull Request #101 · bitcoin-core/minisketch

gmaxwell · 2026-05-21T01:30:26Z

While working with minisketch for an amateur radio application, I came up with a performance improvement:

TraceMod is the slowest part of recovery at large sizes. The existing code computes the Berlekamp trace by repeatedly squaring the current polynomial, adding the x term, and then reducing the whole result modulo the polynomial being split. That last step is polynomial division. For GF(2^32) it is done 31 times for each trace attempt and, typical for a division, it's quite slow.

This adds a fixed-modulus reducer and uses it in RecFindRoots.

There are two new reduction methods. Both involve precomputation per modulus which is reused when a splitting attempt fails, though the reuse hardly contributes to the performance increase because splitting failures are rare for high degree states.

For smaller degrees it precomputes a table for reducing squares. In characteristic two,

(sum a_i x^i)^2 = sum a_i^2 x^(2*i)

because all the cross terms cancel. So squaring a polynomial is not a polynomial multiplication at all: square each coefficient and move it to twice its old exponent.

For a fixed monic modulus f of degree d the only exponents which can need reduction after a square are even exponents from d through 2*d-2. The reducer precomputes

x^e mod f

for those even e. Then reducing a square just means adding the table row for x^(2i), scaled by a_i^2. If 2i is below d it is copied directly. There is no division in the trace loop and no polynomial multiplication: just field squarings, scalar multiplies, and xors. The table construction still computes remainders, but it is done once for the modulus rather than once per field bit for every trace.

Even for higher degrees the square table approach remains much faster than the original code but the table starts to consume a lot of memory because it's quadratic in d.

For larger degrees the code uses reciprocal reduction with a precomputed inverse. This is the usual trick of replacing repeated division by a fixed divisor with multiplication by an inverse. For division by a monic polynomial f, reverse the high part of the input and multiply it by the truncated power series inverse of reverse(f). This gives the quotient reversed. Reverse it back and subtract q*f to get the remainder.

q_rev = reverse(high(val)) * (1 / reverse(f)) mod x^k
q     = reverse(q_rev)
rem   = val ^ q*f

The inverse is unusually convenient here: reverse(f) has constant term 1, so it has a formal power series inverse. If f*g = 1+e, then in characteristic two

f * (f * g^2) = (f*g)^2 = 1 + e^2

so the Newton step can be written as

g' = f * g^2 mod x^(2*m)

and each step doubles the number of correct coefficients. This also benefits from cheap squaring in characteristic two.

This reciprocal method is only useful if the multiplications in it are not just another quadratic bottleneck. With schoolbook low products it mostly moves the cost around. The large-degree path therefore uses the subquadratic low-product multiplication in this patch; otherwise the reciprocal reducer would not be a win at any size.

The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the reciprocal reducer above it. Even if memory usage were a non-issue the reciprocal method is faster for larger sizes. I errored towards making the cutoff lower in light of the fact that smaller cache systems may prefer lower memory usage. A memory sensitive user might want to reduce the threshold further, there is a pretty wide range where the two approaches have similar performance.

The subquadratic multiplies have a threshold for where karatsuba recursive spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24. On my hardware clmul was fastest with 18..24 while generic preferred 40..48. The choice between these two configurations is pretty arbitary, but picking the optimal for either configuration only hurts the other by a couple percent. Since the reciprocal approach and its subquadratic multipliers are only used for higher degrees I figured non-clmul performance was already a lost cause there and favored the better config for clmul.

An obvious opportunity for further development is hoisting the allocations out of the recursive multipliers.

Because the subquadratic multipliers increases the size of the patch a lot I would have preferred to implement only the square table approach, but I really didn't want to add a big performance cliff from falling back to the old approach or quadratic memory usage for large inputs. Implementing only the reciprocal reducer didn't make sense either because it's slower than the original at very low sizes and even high degree inputs spend a fair amount of time running low degree TraceMods.

TraceMod is the slowest part of recovery at large sizes. The existing code computes the Berlekamp trace by repeatedly squaring the current polynomial, adding the x term, and then reducing the whole result modulo the polynomial being split. That last step is polynomial division. For GF(2^32) it is done 31 times for each trace attempt and, typical for a division, it's quite slow. This adds a fixed-modulus reducer and uses it in RecFindRoots. There are two new reduction methods. Both involve precomputation per modulus which is reused when a splitting attempt fails, though the reuse hardly contributes to the performance increase because splitting failures are rare for high degree states. For smaller degrees it precomputes a table for reducing squares. In characteristic two, (sum a_i x^i)^2 = sum a_i^2 x^(2*i) because all the cross terms cancel. So squaring a polynomial is not a polynomial multiplication at all: square each coefficient and move it to twice its old exponent. For a fixed monic modulus f of degree d the only exponents which can need reduction after a square are even exponents from d through 2*d-2. The reducer precomputes x^e mod f for those even e. Then reducing a square just means adding the table row for x^(2*i), scaled by a_i^2. If 2*i is below d it is copied directly. There is no division in the trace loop and no polynomial multiplication: just field squarings, scalar multiplies, and xors. The table construction still computes remainders, but it is done once for the modulus rather than once per field bit for every trace. Even for higher degrees the square table approach remains much faster than the original code but the table starts to consume a lot of memory because it's quadratic in d. For larger degrees the code uses reciprocal reduction with a precomputed inverse. This is the usual trick of replacing repeated division by a fixed divisor with multiplication by an inverse. For division by a monic polynomial f, reverse the high part of the input and multiply it by the truncated power series inverse of reverse(f). This gives the quotient reversed. Reverse it back and subtract q*f to get the remainder. q_rev = reverse(high(val)) * (1 / reverse(f)) mod x^k q = reverse(q_rev) rem = val ^ q*f The inverse is unusually convenient here: reverse(f) has constant term 1, so it has a formal power series inverse. If f*g = 1+e, then in characteristic two f * (f * g^2) = (f*g)^2 = 1 + e^2 so the Newton step can be written as g' = f * g^2 mod x^(2*m) and each step doubles the number of correct coefficients. This also benefits from cheap squaring in characteristic two. This reciprocal method is only useful if the multiplications in it are not just another quadratic bottleneck. With schoolbook low products it mostly moves the cost around. The large-degree path therefore uses the subquadratic low-product multiplication in this patch; otherwise the reciprocal reducer would not be a win at any size. The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the reciprocal reducer above it. Even if memory usage were a non-issue the reciprocal method is faster for larger sizes. I errored towards making the cutoff lower in light of the fact that smaller cache systems may prefer lower memory usage. A memory sensitive user might want to reduce the threshold further, there is a pretty wide range where the two approaches have similar performance. The subquadratic multiplies have a threshold for where karatsuba recursive spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24. On my hardware clmul was fastest with 18..24 while generic preferred 40..48. The choice between these two configurations is pretty arbitary, but picking the optimal for either configuration only hurts the other by a couple percent. Since the reciprocal approach and its subquadratic multipliers are only used for higher degrees I figured non-clmul performance was already a lost cause there and favored the better config for clmul. An obvious opportunity for further development is hoisting the allocations out of the recursive multipliers. Because the subquadratic multipliers increases the size of the patch a lot I would have preferred to implement only the square table approach, but I really didn't want to add a big performance cliff from falling back to the old approach or quadratic memory usage for large inputs. Implementing only the reciprocal reducer didn't make sense either because it's slower than the original at very low sizes and even high degree inputs spend a fair amount of time running low degree TraceMods.

gmaxwell · 2026-05-21T01:33:24Z

Benchmarks were run with ./bench using errors=syndromes, comparing origin/master against this tree:

30generic

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.00594	0.00480	1.238x
8	10000	0.02348	0.01736	1.353x
16	10000	0.07825	0.05296	1.478x
32	10000	0.25672	0.16270	1.578x
64	10000	0.82326	0.50736	1.623x
128	1000	2.75223	1.66350	1.654x
256	1000	9.32600	5.61539	1.661x
512	100	32.90037	22.33600	1.473x
1024	100	119.06383	76.11079	1.564x
2048	100	440.15910	262.30837	1.678x
4096	10	1669.81924	876.16488	1.906x
8192	10	6287.83410	2962.48400	2.122x
16384	10	23767.94820	9720.24374	2.445x

30clmul

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.00153	0.00131	1.168x
8	10000	0.00580	0.00449	1.292x
16	10000	0.02051	0.01473	1.392x
32	10000	0.07403	0.04956	1.494x
64	10000	0.26923	0.17042	1.580x
128	1000	0.99520	0.60902	1.634x
256	1000	3.68811	2.24942	1.640x
512	100	13.88752	8.52037	1.630x
1024	100	52.38271	29.75034	1.761x
2048	100	197.72006	99.49323	1.987x
4096	10	749.44188	335.92846	2.231x
8192	10	2835.77094	1100.99565	2.576x
16384	10	10737.12406	3597.37954	2.985x

32generic

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.00590	0.00464	1.272x
8	10000	0.02344	0.01700	1.379x
16	10000	0.07898	0.05237	1.508x
32	10000	0.25880	0.16269	1.591x
64	10000	0.85348	0.52498	1.626x
128	1000	2.89707	1.71873	1.686x
256	1000	9.93688	5.88484	1.689x
512	100	35.32780	23.28923	1.517x
1024	100	128.93057	79.72189	1.617x
2048	100	483.84048	274.66377	1.762x
4096	10	1835.96819	927.04517	1.980x
8192	10	6992.64223	3121.30186	2.240x
16384	10	26500.79943	10481.75941	2.528x

32clmul

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.00209	0.00173	1.208x
8	10000	0.00866	0.00649	1.334x
16	10000	0.03296	0.02268	1.453x
32	10000	0.12542	0.07989	1.570x
64	10000	0.48076	0.28929	1.662x
128	1000	1.79980	1.07562	1.673x
256	1000	6.79200	4.02129	1.689x
512	100	25.91155	14.50579	1.786x
1024	100	98.52817	49.26626	2.000x
2048	100	375.43699	166.49850	2.255x
4096	10	1431.21744	556.20519	2.573x
8192	10	5457.67815	1813.35139	3.010x
16384	10	20760.05743	5900.72592	3.518x

64generic

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.01892	0.01421	1.331x
8	10000	0.08335	0.05650	1.475x
16	10000	0.29244	0.18210	1.606x
32	10000	1.02687	0.59330	1.731x
64	10000	3.48129	1.97271	1.765x
128	1000	11.98930	6.71013	1.787x
256	1000	42.64799	23.48398	1.816x
512	100	156.77761	94.76361	1.654x
1024	100	591.53841	329.36282	1.796x
2048	100	2277.88486	1136.53660	2.004x
4096	10	8825.14955	3839.37698	2.299x
8192	10	34651.40565	12891.50197	2.688x
16384	10	136070.29395	42431.32552	3.207x

64clmul

syndromes	iters	origin ms	improved ms	speedup
4	10000	0.00408	0.00331	1.233x
8	10000	0.01807	0.01318	1.371x
16	10000	0.06997	0.04724	1.481x
32	10000	0.26974	0.16630	1.622x
64	10000	1.03576	0.60101	1.723x
128	1000	3.95967	2.23919	1.768x
256	1000	15.24281	8.44897	1.804x
512	100	59.19956	31.02885	1.908x
1024	100	230.09819	107.57759	2.139x
2048	100	897.23526	355.77757	2.522x
4096	10	3513.61774	1154.75472	3.043x
8192	10	13755.81133	3829.66813	3.592x
16384	10	53914.92149	12141.21323	4.441x

I'm interested in what the performance looks like on arm with the clmul PR.

gmaxwell · 2026-05-21T04:08:47Z

The test failures on g++ macos look like something wrong with the build host.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast reduction mod poly for TraceMod.#101

Fast reduction mod poly for TraceMod.#101
gmaxwell wants to merge 1 commit into
bitcoin-core:masterfrom
gmaxwell:fast_tracemod_reducers

gmaxwell commented May 21, 2026 •

edited

Loading

Uh oh!

gmaxwell commented May 21, 2026 •

edited

Loading

Uh oh!

gmaxwell commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gmaxwell commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmaxwell commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

30generic

30clmul

32generic

32clmul

64generic

64clmul

Uh oh!

gmaxwell commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gmaxwell commented May 21, 2026 •

edited

Loading

gmaxwell commented May 21, 2026 •

edited

Loading