Skip to content
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
7ecf234
Adding new, but unsued, `_dec_str_to_int_inner()`, + discussion.
tim-one May 8, 2024
6abc9cc
Correction: `decimal` in this conext computes reciprocals of powers
tim-one May 8, 2024
f4c15ce
Clarify precedence in assert.
tim-one May 8, 2024
8673dcb
Update Lib/_pylong.py
tim-one May 8, 2024
b1aa316
Update Lib/_pylong.py
tim-one May 8, 2024
d9490d0
Update Lib/_pylong.py
tim-one May 8, 2024
734ca08
Merge branch 'main' into str2int
tim-one May 8, 2024
0512f07
Merge branch 'main' into str2int
tim-one May 8, 2024
b2734c2
My memory was wrong: using explicit, cached reciprocal approximations…
tim-one May 8, 2024
c0570e7
Merge branch 'str2int' of https://github.com/tim-one/cpython into str…
tim-one May 8, 2024
9906417
Repair mysterious damage to comment.
tim-one May 8, 2024
27b00ea
Cut the precision of the reciprocal too before multiplying.
tim-one May 8, 2024
46c400b
Merge branch 'main' into str2int
tim-one May 8, 2024
1e1b26d
New comments.
tim-one May 9, 2024
2f3b6b8
Typo repair.
tim-one May 9, 2024
fc2f646
Raise BYTELIM.
tim-one May 9, 2024
0a1daab
For long strings, switch to the new implementation if `_decimal` is p…
tim-one May 9, 2024
1977774
📜🤖 Added by blurb_it.
blurb-it[bot] May 9, 2024
5ce49ae
Update 2024-05-09-02-37-25.gh-issue-118750.7aLfT-.rst
tim-one May 9, 2024
998b871
Do a better job of picking working precision for the multiply.
tim-one May 9, 2024
35605ea
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 9, 2024
53b4fa8
Fotce the initial `w` to be an upper bound on the true value.
tim-one May 9, 2024
34774d6
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 9, 2024
1a78510
Serhiy wore me out ;-) Add best possible log(10, 255) constant.
tim-one May 9, 2024
b766ebf
Repair comment.
tim-one May 9, 2024
3b585fb
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 10, 2024
9cccb8d
Repair test for "close to maybe wrong" initial computation of `w`.
tim-one May 10, 2024
96df6f5
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 10, 2024
a241843
Added exact quotient correction for cases adding 1 once isn't enough.
tim-one May 10, 2024
d67dfd1
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 11, 2024
45708db
Check in testing helper I kept ripping out before earlier commits.
tim-one May 11, 2024
83f07da
And one more test, of thta absurdly large inputs are rejected.
tim-one May 11, 2024
ae41a10
Trying to repair new test that failed on some test platforms.
tim-one May 11, 2024
3779a92
I don't know what the "WASI" testbot is, except that it's annoying ;-)
tim-one May 11, 2024
5140c23
And more random thrashing.
tim-one May 11, 2024
45c1da0
Another random stab at making WASI happy.
tim-one May 11, 2024
ee17e54
More random thrashing to try to shut up WASI :-(
tim-one May 11, 2024
98cfca4
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 11, 2024
ed18fdc
Make GUARD a keyword-only argument. Plus many comment changes.
tim-one May 11, 2024
a5def42
repair spelling mistake in comment
tim-one May 11, 2024
727d396
Update Lib/_pylong.py
tim-one May 11, 2024
505012c
Update Lib/_pylong.py
tim-one May 11, 2024
25b47c7
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 11, 2024
83ea161
I now believe GUARD=3(!) is sufficient. And that's the smallest
tim-one May 11, 2024
07c70bd
Apply suggestions from code review
tim-one May 11, 2024
fcfa5aa
The lint checker whinef about a blank line at the end of the file.
tim-one May 11, 2024
7763ab2
And restoring another code review change.
tim-one May 11, 2024
b1443ef
Update Lib/test/test_int.py
tim-one May 11, 2024
3ee6247
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 12, 2024
44c85ac
Made suggested word change.
tim-one May 12, 2024
1a90eb6
Just commehts: be more careful about explaining how bad the
tim-one May 12, 2024
cd07da3
Typo repair.
tim-one May 12, 2024
a985009
Someone should contribute a spell-checker to IDLE ;-)
tim-one May 12, 2024
7d4c3e8
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 12, 2024
ec326b3
Free the memory for `hi` as soon as we're done with it.
tim-one May 12, 2024
922dca1
Add notes about the details of how reciprocals are computed.
tim-one May 12, 2024
9fe92f3
Repair grammar.
tim-one May 12, 2024
4e360e8
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 12, 2024
4eaccfc
Apply suggestions from code review
tim-one May 12, 2024
19c0690
Merge branch 'str2int' of https://github.com/tim-one/cpython into str…
tim-one May 12, 2024
5178e09
Big change: keeping the exact reciprocals now.
tim-one May 13, 2024
5a0a574
Remove no-longer-useful reset of ctx.prec.
tim-one May 13, 2024
1ee272e
Reciprocal approximations are back - you knew it was coming ;-)
tim-one May 13, 2024
3129d5f
Comment repair.
tim-one May 13, 2024
d1a1fce
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 13, 2024
bfd6247
Finished the proof - no more hand-waving on any point :-)
tim-one May 13, 2024
751aab4
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 13, 2024
81ef287
Fix tiny typos in comments.
tim-one May 13, 2024
d5818df
Fudge. Typo repair is adding new typos too :-(
tim-one May 13, 2024
70e7240
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 13, 2024
ca621e2
Consolidate all the seemingly random observations aboud .adjusted(),
tim-one May 14, 2024
c8b69b8
typo
tim-one May 14, 2024
5cff910
And we don't ever need exact reicrocals after all!
tim-one May 14, 2024
29ecb3f
Remove a mention of exact reciprocals, since they're no longer used.
tim-one May 14, 2024
a2bbe09
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 14, 2024
495fe8e
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 15, 2024
617b5e2
Update comments. No code changes.
tim-one May 15, 2024
28552fb
And exact reciprocals are back.
tim-one May 15, 2024
a523fd4
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 15, 2024
03225f3
Split on ceiling(w/2) instead of on the floor.
tim-one May 15, 2024
0b454ba
Delete a line of unused code.
tim-one May 15, 2024
1afe4df
Update Lib/_pylong.py
tim-one May 15, 2024
6999ec7
Merge branch 'str2int' of https://github.com/tim-one/cpython into str…
tim-one May 15, 2024
1f52d1d
typo
tim-one May 15, 2024
f1469cf
Bah. Another typo.
tim-one May 15, 2024
aa8381a
Merge branch 'main' into str2int
tim-one May 15, 2024
d24ae92
Repair thinko in comment.
tim-one May 15, 2024
e1549b6
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 15, 2024
3d74801
Splitting on ceiling(w/2) gave compute_powers() a harder job.
tim-one May 15, 2024
e435907
Merge branch 'main' into str2int
tim-one May 15, 2024
c6f6126
Reduce excessive indentation.
tim-one May 15, 2024
74bcebb
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 16, 2024
52205b4
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 16, 2024
61253a4
Give compute_powers() another IQ boost.
tim-one May 16, 2024
a2814b5
Add basic compoute_powers() test.
tim-one May 16, 2024
b16e639
At least whan I make a typo, auto-commplete reproduces it ;-)
tim-one May 16, 2024
bc440b8
Fix old typo in comment everyone missed ;-)
tim-one May 16, 2024
60797ab
Correct technical detail in comment.
tim-one May 16, 2024
1d4f3a0
And another stray typo :-(
tim-one May 16, 2024
c5ef2ce
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 16, 2024
0033cd5
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 17, 2024
057b5e9
Apply suggestions from code review
tim-one May 17, 2024
7570147
Merge branch 'str2int' of https://github.com/tim-one/cpython into str…
tim-one May 17, 2024
fc09650
Apply suggestions from code review
tim-one May 17, 2024
484dd0b
Merge branch 'str2int' of https://github.com/tim-one/cpython into str…
tim-one May 17, 2024
6c634c7
Add limit=0 to compute_powers() "by hand" testing.
tim-one May 18, 2024
a288adc
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 18, 2024
9ee61f6
Reduce cutoff for calling the new implementation from 3.5M to 2M.
tim-one May 18, 2024
8d5bc36
Update NEWS.
tim-one May 18, 2024
46cf316
Sheesh - put a wrong number in the new NEWS.
tim-one May 18, 2024
e39985e
Merge remote-tracking branch 'upstream/main' into str2int
tim-one May 18, 2024
f1cf315
Spell out the additional new reason not to do int(Decimal) directly.
tim-one May 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 137 additions & 1 deletion Lib/_pylong.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,121 @@ def inner(a, b):
w5pow = compute_powers(len(s), 5, DIGLIM)
return inner(0, len(s))

# Asymptotically faster version, using the C decimal module. See
# comments at the end of the file. This uses decimal arithmetic to
# convert from base 10 to base 256. The latter is just a string of
# bytes, which CPython can convert very efficiently to a Python int.

# log of 10 to base 256 with best-possible 53-bit precision. Obtained
# via:
# from mpmath import mp
# mp.prec = 1000
# print(float(mp.log(10, 256)).hex())
_LOG_10_BASE_256 = float.fromhex('0x1.a934f0979a371p-2') # about 0.415

def _dec_str_to_int_inner(s):
BYTELIM = 512
D = decimal.Decimal
result = bytearray()

def inner(n, w):
#assert n < D256 ** w # required, but too expensive to check
if w <= BYTELIM:
# XXX Stefan Pochmann discovered that, for 1024-bit ints,
Comment thread
tim-one marked this conversation as resolved.
# `int(Decimal)` took 2.5x longer than `int(str(Decimal))`.
# So simplify this code to the former if/when that gets
# repaired.
result.extend(int(str(n)).to_bytes(w)) # big-endian default
return
w2 = w >> 1
if 0:
# This is maximally clear, but "too slow". `decimal`
# division is asymptotically fast, but we have no way to
# tell it to reuse the high-precision reciprocal it computes
# for pow256[w2], so it has to recompute it over & over &
# over again :-(
hi, lo = divmod(n, pow256[w2][0])
else:
p256, recip = pow256[w2]
# The integer part will have a number of digits about equal
# to the difference between the log10s of `n` and `pow256`
# (which, since these are integers, is roughly approximated
# by `.adjusted()`). That's the working precision we need,
# but add some guard digits to protect against the "about"
# and "roughly" uncertainties.
ctx.prec = max(n.adjusted() - p256.adjusted(), 0) + 8
hi = +n * +recip # unary `+` chops back to ctx.prec digits
ctx.prec = decimal.MAX_PREC
hi = hi.to_integral_value() # lose the fractional digits
lo = n - hi * p256
# Because we've been uniformly rounding down, `hi` is a
# lower bound on the correct quotient.
assert lo >= 0
# Adjust quotient up if needed. It usually isn't. In random
# testing, the loop body entered about one in 100 thousand
# cases. I never saw it need more than one iteration.
count = 0
while lo >= p256:
count += 1
# If the assert fails, chances are decent we're sooooo
# far off it may seem to run forever otherwise - the
# error analysis was fatally flawed in this case.
assert count < 10, (count, w, len(s),
n.adjusted(), p256.adjusted())
lo -= p256
hi += 1
Comment thread
tim-one marked this conversation as resolved.
# The assert should always succeed, but way too slow to keep
# enabled.
#assert hi, lo == divmod(n, pow256[w2][0])
inner(hi, w - w2)
inner(lo, w2)

# How many base 256 digits are needed?. Mathematically, exactly
# floor(log256(int(s))) + 1. There is no cheap way to compute this.
# But we can get an upper buond, and that's necessary for our error
Comment thread
tim-one marked this conversation as resolved.
Outdated
# analysis to make sense. int(s) < 10**len(s), so the log needed is
# < log256(10**len(s)) = len(s) * log256(10). However, using
# finite-precision floating point for this, it's possible that the
# computed value is a little less than the true value. If the true
# value is at - or a little higher than - an integer, we can get an
# off-by-1 error too low. So we add 2 instead of 1 if chopping lost
# a fraction > 0.9.
log_ub = len(s) * _LOG_10_BASE_256
log_ub_as_int = int(log_ub)
w = log_ub_as_int + 1 + (log_ub - log_ub_as_int > 0.9)
Comment thread
tim-one marked this conversation as resolved.
# And what if we'vv plain exhausted the limits of HW floats? We
Comment thread
tim-one marked this conversation as resolved.
Outdated
# could compute the log to any desired precision using `decimal`,
# but it's not plausible that anyone will pass a string requiring
# trillions of bytes (unles they're just trying to "break things").
Comment thread
tim-one marked this conversation as resolved.
Outdated
if w.bit_length() >= 46:
# "Only" had < 53 - 46 = 7 bits to spare in IEEE-754 double.
# XXX I can't test this - don't have 169 terabytes of RAM to
# build a string long enough to trigger this.
raise ValueError(f"cannot convert string of len {len(s)} to int")
with decimal.localcontext(_unbounded_dec_context) as ctx:
Comment thread
tim-one marked this conversation as resolved.
D256 = D(256)
pow256 = compute_powers(w, D256, BYTELIM)
rpow256 = compute_powers(w, 1 / D256, BYTELIM)
# We're going to do inexact, chopped arithmetic, multiplying by
# an approximation to the reciprocal of 256**i. We chop to get a
# lower bound on the true integer quotient. Our approximation is
# a lower bound, the multiply is chopped too, and
Comment thread
tim-one marked this conversation as resolved.
Outdated
# to_integral_value() is also chopped.
ctx.traps[decimal.Inexact] = 0
ctx.rounding = decimal.ROUND_DOWN
for k, v in pow256.items():
# No need to save more precision in the reciprocal than the
# power of 256 has, plus some guard digits to absorb most
# relevant rounding errors. This is highly signficant:
# 1/2**i has the same number of significant decimal digits
# as 5**i, generally over twice the number in 2**i,
ctx.prec = v.adjusted() + 8
# The unary "+" chope the reciprocal back to that precision.
Comment thread
tim-one marked this conversation as resolved.
Outdated
pow256[k] = v, +rpow256[k]
del rpow256 # exact reciprocals no longer needed
ctx.prec = decimal.MAX_PREC
inner(D(s), w)
return int.from_bytes(result)

def int_from_string(s):
"""Asymptotically fast version of PyLong_FromString(), conversion
Expand All @@ -219,7 +334,10 @@ def int_from_string(s):
# and underscores, and stripped leading whitespace. The input can still
# contain underscores and have trailing whitespace.
s = s.rstrip().replace('_', '')
return _str_to_int_inner(s)
func = _str_to_int_inner
if len(s) >= 3_500_000 and _decimal is not None:
Comment thread
tim-one marked this conversation as resolved.
Outdated
func = _dec_str_to_int_inner
return func(s)

def str_to_int(s):
"""Asymptotically fast version of decimal string to 'int' conversion."""
Expand Down Expand Up @@ -361,3 +479,21 @@ def int_divmod(a, b):
return ~q, b + ~r
else:
return _divmod_pos(a, b)


# Notes on _dec_str_to_int_inner:
#
# Stefan Pochmann worked up a str->int function that used the decimal
# module to, in effect, convert from base 10 to base 256. This is
# "unnatural", in that it requires multiplying and dividing by large
# powers of 2, which `decimal` isn't naturally suited to. But
# `decimal`'s `*` and `/` are asymptotically superior to CPython's, so
# at _some_ point it could be expected to win.
#
# Alas, the crossover point was too high to be of much real interest. I
# (Tim) then worked on ways to replace its division with multiplication
# by a cached reciprocal approximation instead, fixing up errors
# afterwards. This reduced the crossover point significantly,
#
# I revisited tha code, and found ways to improve and simplify it. The
Comment thread
tim-one marked this conversation as resolved.
Outdated
# crossover point is at about 3.4 million digits now.
10 changes: 10 additions & 0 deletions Lib/test/test_int.py
Original file line number Diff line number Diff line change
Expand Up @@ -919,5 +919,15 @@ def test_pylong_roundtrip(self):
self.assertEqual(n, int(sn))
bits <<= 1

@support.requires_resource('cpu')
def test_pylong_roundtrip_huge(self):
# k blocks of 1234567890
k = 1_000_000 # so 10 million digits in all
tentoten = 10**10
n = 1234567890 * ((tentoten**k - 1) // (tentoten - 1))
sn = "1234567890" * k
self.assertEqual(n, int(sn))
self.assertEqual(sn, str(n))

if __name__ == "__main__":
unittest.main()
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
If the C version of the ``decimal`` module is available, ``int(str)`` now uses it to supply an asymptotically much faster conversion. However, this only applies if the string contains over about 3.5 million digits.