-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
gh-91432: Specialize FOR_ITER #91713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Microbenchmarks so far: Detailsbenchmark: from pyperf import Runner, perf_counter
from itertools import repeat
def for_range(loops, length):
repetitions = repeat(None, loops)
R = range(length)
t0 = perf_counter()
for _ in repetitions:
for x in R:
pass
t1 = perf_counter()
return t1 - t0
def for_list(loops, length):
repetitions = repeat(None, loops)
L = list(map(float, range(length)))
t0 = perf_counter()
for _ in repetitions:
for x in L:
pass
t1 = perf_counter()
return t1 - t0
bench = Runner().bench_time_func
for n in [20, 200, 2_000, 20_000, 200_000]:
bench(f"for_range {n:_}", for_range, n, inner_loops=n)
bench(f"for_list {n:_}", for_list, n, inner_loops=n)Result: |
|
The one unfortunate thing about this implementation so far is that it restricts the fast range iterator (which currently uses arbitrary c longs), to only values that will fit in a 1-digit int, which is |
5d5d807 to
bd7575e
Compare
|
Pyperformance: Geometric mean: 1.01x faster: DetailsResults: |
|
I'm not seeing as much of a speedup, but I think this is worthwhile even if it produces no speedup, as it enables specialization for other types, specifically generators. |
This comment was marked as resolved.
This comment was marked as resolved.
markshannon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds quite a lot of complexity to the specialization for range, handling the following store as well.
What sort of performance do you get just specializing the FOR_ITER and not the following STORE_FAST?
Lib/test/test_range.py
Outdated
| # check first 100 entries | ||
| self.assert_iterators_equal(iter1, iter2, test_id, limit=100) | ||
| # check first 10 entries | ||
| self.assert_iterators_equal(iter1, iter2, test_id, limit=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why reduce from 100 to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it 4x slower otherwise by doubling the number of M values, and it started to feel a bit sluggish, so I wanted to offset that.
Objects/rangeobject.c
Outdated
| it->start = start; | ||
| it->step = step; | ||
| it->len = len; | ||
| it->start = Py_SAFE_DOWNCAST(start, long, sdigit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to change the signature of fast_range_iter to take sdigits, rather than longs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure -- downcasts need to happen somewhere, and fast_range_iter is called in two places. The PyLong_AsLong calls could theoretically be replaced with Py_SIZE checks and ob_digit[0] accesses, but I think the existing code is fine using the public API.
I measured no speedup at all, even on microbenchmarks, when the range opcode was just something like if (r->index < r->len) {
PyObject *res = PyLong_FromLong(
(long)(r->start + (unsigned long)(r->index++) * r->step));
if (res == NULL) {
goto error;
}
PUSH(res);
PREDICT(STORE_FAST);
NOTRACE_DISPATCH();
} |
Python/ceval.c
Outdated
| } | ||
| JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1); | ||
| sdigit value = r->start + (digit)(r->index++) * r->step; | ||
| if (value < _PY_NSMALLPOSINTS && value >= -_PY_NSMALLNEGINTS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are too many branches here.
Ideally a specialized instruction should have a few DEOPT_IF tests, then the rest of the code should be (mostly) branchless.
This also includes far too much knowledge of the internals of the int object, IMO.
Would performance be measurably worse if start, end and stop were Py_ssize_t and this code were simplified to
PyLong_FromSsize_t?
This feels like an elaborate workaround the oddities of our int implementation.
We should be fixing that, not working around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may be a middle ground here: add a function like this to longobject.c:
int
_PyLong_AssignValueLong(PyLongObject **location, long value)
{
if value in small:
set to small, decref old long
elif refcnt(*location) == 1 and the right size:
mutate in place
else:
Py_SETREF(*location, PyLong_FromLong(value));
}
Sort of similar in spirit to PyUnicode_Append()
That way, the PyLongObject implementation details stay out of the eval loop, but we still get the benefit of avoiding the allocation, which I believe is the main source of the speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could theoretically be re-used for enumerate() as well.
And even if the implementation of integers changes (e.g. new freelists, no more small ints), the operation still makes sense.
|
Ultimately we want to have virtual iterators, and specialization of If I were doing the whole thing myself, from scratch, I would do the following:
Since we already have this PR, it probably makes sense to do 2 first, but I think we need to simplify |
14fbedb to
d55868f
Compare
|
Pyperformance shows a 1% speedup with 35 faster, and 15 slower. Details+-------------------------+------------------------------------+-------------------------------------------------+ Benchmark hidden because not significant (12): tornado_http, sqlalchemy_imperative, sqlalchemy_declarative, sympy_sum, json_loads, sympy_expand, nbody, pathlib, xml_etree_process, genshi_xml, xml_etree_parse, richards |
markshannon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor issues, otherwise looks good.
Objects/longobject.c
Outdated
| } | ||
|
|
||
| int | ||
| _PyLong_AssignValue(PyObject **target, long value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this Py_ssize_t rather than long?
On 64 bit windows, sizeof(long) != sizeof(void *) which makes reasoning about code with long harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change PyRangeIterator to use Py_ssize_t as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately yes. But maybe not in this PR, since that isn't new code.
| if (_PyLong_AssignValue(&GETLOCAL(_Py_OPARG(next)), value) < 0) { | ||
| goto error; | ||
| } | ||
| JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why this is INLINE_CACHE_ENTRIES_FOR_ITER + 1 rather than just INLINE_CACHE_ENTRIES_FOR_ITER.
|
When you're done making the requested changes, leave the comment: |
| Py_XDECREF(old); | ||
| return 0; | ||
| } | ||
| else if (old != NULL && PyLong_CheckExact(old) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works for positive integers.
Given that it is only used for range iteration, that's fine for now.
But could you add a comment, so it doesn't get overlooked when we implement faster-cpython/ideas#147
|
Thanks. Looks good. |
|
I'll merge once the build bots are happy. |
|
🤖 New build scheduled with the buildbot fleet by @markshannon for commit db21da1 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
Faster-Cpython thread: faster-cpython/ideas#67
GH issue: #91432