Fast reduction mod poly for TraceMod.#101
Open
gmaxwell wants to merge 1 commit into
Open
Conversation
TraceMod is the slowest part of recovery at large sizes. The existing code
computes the Berlekamp trace by repeatedly squaring the current polynomial,
adding the x term, and then reducing the whole result modulo the polynomial
being split. That last step is polynomial division. For GF(2^32) it is done
31 times for each trace attempt and, typical for a division, it's quite slow.
This adds a fixed-modulus reducer and uses it in RecFindRoots.
There are two new reduction methods. Both involve precomputation per modulus
which is reused when a splitting attempt fails, though the reuse hardly
contributes to the performance increase because splitting failures are rare
for high degree states.
For smaller degrees it precomputes a table for reducing squares. In
characteristic two,
(sum a_i x^i)^2 = sum a_i^2 x^(2*i)
because all the cross terms cancel. So squaring a polynomial is not a
polynomial multiplication at all: square each coefficient and move it to twice
its old exponent.
For a fixed monic modulus f of degree d the only exponents which can need
reduction after a square are even exponents from d through 2*d-2. The reducer
precomputes
x^e mod f
for those even e. Then reducing a square just means adding the table row for
x^(2*i), scaled by a_i^2. If 2*i is below d it is copied directly. There is
no division in the trace loop and no polynomial multiplication: just field
squarings, scalar multiplies, and xors. The table construction still computes
remainders, but it is done once for the modulus rather than once per field bit
for every trace.
Even for higher degrees the square table approach remains much faster than
the original code but the table starts to consume a lot of memory because
it's quadratic in d.
For larger degrees the code uses reciprocal reduction with a precomputed
inverse. This is the usual trick of replacing repeated division by a fixed
divisor with multiplication by an inverse. For division by a monic polynomial
f, reverse the high part of the input and multiply it by the truncated power
series inverse of reverse(f). This gives the quotient reversed. Reverse it
back and subtract q*f to get the remainder.
q_rev = reverse(high(val)) * (1 / reverse(f)) mod x^k
q = reverse(q_rev)
rem = val ^ q*f
The inverse is unusually convenient here: reverse(f) has constant term 1, so it
has a formal power series inverse. If f*g = 1+e, then in characteristic two
f * (f * g^2) = (f*g)^2 = 1 + e^2
so the Newton step can be written as
g' = f * g^2 mod x^(2*m)
and each step doubles the number of correct coefficients. This also benefits
from cheap squaring in characteristic two.
This reciprocal method is only useful if the multiplications in it are not just
another quadratic bottleneck. With schoolbook low products it mostly moves the
cost around. The large-degree path therefore uses the subquadratic low-product
multiplication in this patch; otherwise the reciprocal reducer would not be a
win at any size.
The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the
reciprocal reducer above it. Even if memory usage were a non-issue the reciprocal
method is faster for larger sizes. I errored towards making the cutoff lower in
light of the fact that smaller cache systems may prefer lower memory usage.
A memory sensitive user might want to reduce the threshold further, there
is a pretty wide range where the two approaches have similar performance.
The subquadratic multiplies have a threshold for where karatsuba recursive
spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24.
On my hardware clmul was fastest with 18..24 while generic preferred 40..48.
The choice between these two configurations is pretty arbitary, but picking
the optimal for either configuration only hurts the other by a couple percent.
Since the reciprocal approach and its subquadratic multipliers are only used
for higher degrees I figured non-clmul performance was already a lost cause
there and favored the better config for clmul.
An obvious opportunity for further development is hoisting the allocations
out of the recursive multipliers.
Because the subquadratic multipliers increases the size of the patch a lot I
would have preferred to implement only the square table approach, but I
really didn't want to add a big performance cliff from falling back to the
old approach or quadratic memory usage for large inputs. Implementing only
the reciprocal reducer didn't make sense either because it's slower than
the original at very low sizes and even high degree inputs spend a fair
amount of time running low degree TraceMods.
Contributor
Author
|
Benchmarks were run with ./bench using errors=syndromes, comparing origin/master against this tree: 30generic
30clmul
32generic
32clmul
64generic
64clmul
I'm interested in what the performance looks like on arm with the clmul PR. |
Contributor
Author
|
The test failures on g++ macos look like something wrong with the build host. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While working with minisketch for an amateur radio application, I came up with a performance improvement:
TraceMod is the slowest part of recovery at large sizes. The existing code computes the Berlekamp trace by repeatedly squaring the current polynomial, adding the x term, and then reducing the whole result modulo the polynomial being split. That last step is polynomial division. For GF(2^32) it is done 31 times for each trace attempt and, typical for a division, it's quite slow.
This adds a fixed-modulus reducer and uses it in RecFindRoots.
There are two new reduction methods. Both involve precomputation per modulus which is reused when a splitting attempt fails, though the reuse hardly contributes to the performance increase because splitting failures are rare for high degree states.
For smaller degrees it precomputes a table for reducing squares. In characteristic two,
because all the cross terms cancel. So squaring a polynomial is not a polynomial multiplication at all: square each coefficient and move it to twice its old exponent.
For a fixed monic modulus f of degree d the only exponents which can need reduction after a square are even exponents from d through 2*d-2. The reducer precomputes
for those even e. Then reducing a square just means adding the table row for x^(2i), scaled by a_i^2. If 2i is below d it is copied directly. There is no division in the trace loop and no polynomial multiplication: just field squarings, scalar multiplies, and xors. The table construction still computes remainders, but it is done once for the modulus rather than once per field bit for every trace.
Even for higher degrees the square table approach remains much faster than the original code but the table starts to consume a lot of memory because it's quadratic in d.
For larger degrees the code uses reciprocal reduction with a precomputed inverse. This is the usual trick of replacing repeated division by a fixed divisor with multiplication by an inverse. For division by a monic polynomial f, reverse the high part of the input and multiply it by the truncated power series inverse of reverse(f). This gives the quotient reversed. Reverse it back and subtract q*f to get the remainder.
The inverse is unusually convenient here: reverse(f) has constant term 1, so it has a formal power series inverse. If f*g = 1+e, then in characteristic two
so the Newton step can be written as
and each step doubles the number of correct coefficients. This also benefits from cheap squaring in characteristic two.
This reciprocal method is only useful if the multiplications in it are not just another quadratic bottleneck. With schoolbook low products it mostly moves the cost around. The large-degree path therefore uses the subquadratic low-product multiplication in this patch; otherwise the reciprocal reducer would not be a win at any size.
The current cutoff uses the square table below TRACEMOD_TABLE_CUTOFF=512 and the reciprocal reducer above it. Even if memory usage were a non-issue the reciprocal method is faster for larger sizes. I errored towards making the cutoff lower in light of the fact that smaller cache systems may prefer lower memory usage. A memory sensitive user might want to reduce the threshold further, there is a pretty wide range where the two approaches have similar performance.
The subquadratic multiplies have a threshold for where karatsuba recursive spliting is replaced with a naive multiply: TRACEMOD_POLYMUL_CUTOFF=24. On my hardware clmul was fastest with 18..24 while generic preferred 40..48. The choice between these two configurations is pretty arbitary, but picking the optimal for either configuration only hurts the other by a couple percent. Since the reciprocal approach and its subquadratic multipliers are only used for higher degrees I figured non-clmul performance was already a lost cause there and favored the better config for clmul.
An obvious opportunity for further development is hoisting the allocations out of the recursive multipliers.
Because the subquadratic multipliers increases the size of the patch a lot I would have preferred to implement only the square table approach, but I really didn't want to add a big performance cliff from falling back to the old approach or quadratic memory usage for large inputs. Implementing only the reciprocal reducer didn't make sense either because it's slower than the original at very low sizes and even high degree inputs spend a fair amount of time running low degree TraceMods.