Skip to content

Conversation

@sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Apr 19, 2022

Faster-Cpython thread: faster-cpython/ideas#67
GH issue: #91432

@sweeneyde
Copy link
Member Author

sweeneyde commented Apr 19, 2022

Microbenchmarks so far:

Details

benchmark:

from pyperf import Runner, perf_counter
from itertools import repeat

def for_range(loops, length):
    repetitions = repeat(None, loops)
    R = range(length)

    t0 = perf_counter()
    for _ in repetitions:
        for x in R:
            pass
    t1 = perf_counter()

    return t1 - t0

def for_list(loops, length):
    repetitions = repeat(None, loops)
    L = list(map(float, range(length)))

    t0 = perf_counter()
    for _ in repetitions:
        for x in L:
            pass
    t1 = perf_counter()

    return t1 - t0

bench = Runner().bench_time_func
for n in [20, 200, 2_000, 20_000, 200_000]:
    bench(f"for_range {n:_}", for_range, n, inner_loops=n)
    bench(f"for_list {n:_}", for_list, n, inner_loops=n)

Result:

Faster (10):
- for_range 200_000: 11.6 ns +- 0.4 ns -> 3.90 ns +- 0.02 ns: 2.97x faster
- for_range 20_000: 11.3 ns +- 0.1 ns -> 3.91 ns +- 0.03 ns: 2.90x faster
- for_range 2_000: 10.7 ns +- 0.2 ns -> 3.97 ns +- 0.03 ns: 2.69x faster
- for_range 200: 6.41 ns +- 0.16 ns -> 4.27 ns +- 0.06 ns: 1.50x faster
- for_range 20: 8.82 ns +- 0.48 ns -> 7.12 ns +- 0.42 ns: 1.24x faster
- for_list 20: 6.99 ns +- 0.50 ns -> 6.11 ns +- 0.52 ns: 1.14x faster
- for_list 20_000: 5.12 ns +- 0.13 ns -> 4.55 ns +- 0.05 ns: 1.12x faster
- for_list 200: 5.19 ns +- 0.12 ns -> 4.62 ns +- 0.11 ns: 1.12x faster
- for_list 2_000: 5.02 ns +- 0.05 ns -> 4.53 ns +- 0.09 ns: 1.11x faster
- for_list 200_000: 5.57 ns +- 0.08 ns -> 5.06 ns +- 0.06 ns: 1.10x faster

Geometric mean: 1.54x faster

@sweeneyde
Copy link
Member Author

sweeneyde commented Apr 19, 2022

The one unfortunate thing about this implementation so far is that it restricts the fast range iterator (which currently uses arbitrary c longs), to only values that will fit in a 1-digit int, which is abs(x) < 2**30 or abs(x) < 2**15, depending on platform.

@sweeneyde sweeneyde added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes labels Apr 19, 2022
@sweeneyde
Copy link
Member Author

sweeneyde commented Apr 20, 2022

Pyperformance: Geometric mean: 1.01x faster:

Details

Results:

Slower (9):
- pyflate: 414 ms +- 8 ms -> 424 ms +- 5 ms: 1.02x slower
- pidigits: 179 ms +- 1 ms -> 183 ms +- 2 ms: 1.02x slower
- dulwich_log: 75.4 ms +- 1.2 ms -> 76.9 ms +- 1.8 ms: 1.02x slower
- chameleon: 6.20 ms +- 0.25 ms -> 6.32 ms +- 0.24 ms: 1.02x slower
- pickle_list: 4.43 us +- 0.13 us -> 4.51 us +- 0.09 us: 1.02x slower
- unpickle_list: 4.77 us +- 0.07 us -> 4.86 us +- 0.08 us: 1.02x slower
- pathlib: 19.0 ms +- 0.5 ms -> 19.3 ms +- 0.3 ms: 1.01x slower
- scimark_lu: 98.1 ms +- 1.2 ms -> 99.3 ms +- 1.7 ms: 1.01x slower
- go: 139 ms +- 2 ms -> 140 ms +- 3 ms: 1.01x slower

Faster (26):
- scimark_sparse_mat_mult: 5.02 ms +- 0.18 ms -> 4.46 ms +- 0.30 ms: 1.12x faster
- nbody: 89.9 ms +- 2.1 ms -> 83.5 ms +- 5.4 ms: 1.08x faster
- mako: 10.1 ms +- 0.9 ms -> 9.42 ms +- 0.11 ms: 1.08x faster
- pickle: 12.4 us +- 0.7 us -> 11.6 us +- 1.1 us: 1.07x faster
- float: 75.7 ms +- 1.5 ms -> 71.5 ms +- 1.1 ms: 1.06x faster
- unpack_sequence: 44.6 ns +- 5.0 ns -> 42.5 ns +- 0.8 ns: 1.05x faster
- meteor_contest: 109 ms +- 3 ms -> 105 ms +- 2 ms: 1.04x faster
- regex_v8: 22.1 ms +- 0.4 ms -> 21.3 ms +- 0.3 ms: 1.04x faster
- crypto_pyaes: 85.4 ms +- 2.4 ms -> 82.3 ms +- 1.9 ms: 1.04x faster
- telco: 6.36 ms +- 0.15 ms -> 6.13 ms +- 0.16 ms: 1.04x faster
- chaos: 72.9 ms +- 1.7 ms -> 70.4 ms +- 1.9 ms: 1.04x faster
- unpickle: 16.0 us +- 1.1 us -> 15.5 us +- 1.2 us: 1.03x faster
- django_template: 37.7 ms +- 0.8 ms -> 36.6 ms +- 1.8 ms: 1.03x faster
- html5lib: 66.7 ms +- 2.6 ms -> 64.8 ms +- 2.6 ms: 1.03x faster
- regex_effbot: 2.87 ms +- 0.04 ms -> 2.80 ms +- 0.03 ms: 1.02x faster
- sqlalchemy_imperative: 23.7 ms +- 0.6 ms -> 23.2 ms +- 0.9 ms: 1.02x faster
- regex_dna: 197 ms +- 3 ms -> 193 ms +- 4 ms: 1.02x faster
- unpickle_pure_python: 231 us +- 7 us -> 227 us +- 5 us: 1.02x faster
- fannkuch: 384 ms +- 6 ms -> 376 ms +- 6 ms: 1.02x faster
- regex_compile: 138 ms +- 2 ms -> 136 ms +- 3 ms: 1.02x faster
- hexiom: 6.30 ms +- 0.12 ms -> 6.20 ms +- 0.14 ms: 1.02x faster
- sqlite_synth: 2.35 us +- 0.04 us -> 2.32 us +- 0.03 us: 1.01x faster
- logging_silent: 92.5 ns +- 2.0 ns -> 91.6 ns +- 1.7 ns: 1.01x faster
- sympy_integrate: 20.9 ms +- 0.3 ms -> 20.7 ms +- 0.3 ms: 1.01x faster
- xml_etree_process: 52.5 ms +- 1.0 ms -> 52.2 ms +- 0.8 ms: 1.01x faster
- scimark_sor: 107 ms +- 2 ms -> 106 ms +- 2 ms: 1.01x faster

Benchmark hidden because not significant (24): 2to3, deltablue, json_dumps, json_loads, logging_format, logging_simple, nqueens, pickle_dict, pickle_pure_python, python_startup, python_startup_no_site, raytrace, richards, scimark_fft, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sympy_expand, sympy_sum, sympy_str, tornado_http, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.01x faster

@markshannon
Copy link
Member

I'm not seeing as much of a speedup, but I think this is worthwhile even if it produces no speedup, as it enables specialization for other types, specifically generators.

@sweeneyde

This comment was marked as resolved.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds quite a lot of complexity to the specialization for range, handling the following store as well.

What sort of performance do you get just specializing the FOR_ITER and not the following STORE_FAST?

# check first 100 entries
self.assert_iterators_equal(iter1, iter2, test_id, limit=100)
# check first 10 entries
self.assert_iterators_equal(iter1, iter2, test_id, limit=10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reduce from 100 to 10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it 4x slower otherwise by doubling the number of M values, and it started to feel a bit sluggish, so I wanted to offset that.

it->start = start;
it->step = step;
it->len = len;
it->start = Py_SAFE_DOWNCAST(start, long, sdigit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to change the signature of fast_range_iter to take sdigits, rather than longs?

Copy link
Member Author

@sweeneyde sweeneyde Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure -- downcasts need to happen somewhere, and fast_range_iter is called in two places. The PyLong_AsLong calls could theoretically be replaced with Py_SIZE checks and ob_digit[0] accesses, but I think the existing code is fine using the public API.

@sweeneyde
Copy link
Member Author

What sort of performance do you get just specializing the FOR_ITER and not the following STORE_FAST?

I measured no speedup at all, even on microbenchmarks, when the range opcode was just something like

            if (r->index < r->len) {
                PyObject *res = PyLong_FromLong(
                    (long)(r->start + (unsigned long)(r->index++) * r->step));
                if (res == NULL) {
                    goto error;
                }
                PUSH(res);
                PREDICT(STORE_FAST);
                NOTRACE_DISPATCH();
            }

@sweeneyde sweeneyde changed the title Specialize FOR_ITER gh-91432: Specialize FOR_ITER Apr 24, 2022
Python/ceval.c Outdated
}
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1);
sdigit value = r->start + (digit)(r->index++) * r->step;
if (value < _PY_NSMALLPOSINTS && value >= -_PY_NSMALLNEGINTS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are too many branches here.
Ideally a specialized instruction should have a few DEOPT_IF tests, then the rest of the code should be (mostly) branchless.

This also includes far too much knowledge of the internals of the int object, IMO.

Would performance be measurably worse if start, end and stop were Py_ssize_t and this code were simplified to
PyLong_FromSsize_t?
This feels like an elaborate workaround the oddities of our int implementation.
We should be fixing that, not working around it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be a middle ground here: add a function like this to longobject.c:

int
_PyLong_AssignValueLong(PyLongObject **location, long value)
{
    if value in small:
        set to small, decref old long
    elif refcnt(*location) == 1 and the right size:
        mutate in place
    else:
        Py_SETREF(*location, PyLong_FromLong(value));
}

Sort of similar in spirit to PyUnicode_Append()

That way, the PyLongObject implementation details stay out of the eval loop, but we still get the benefit of avoiding the allocation, which I believe is the main source of the speedup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could theoretically be re-used for enumerate() as well.

And even if the implementation of integers changes (e.g. new freelists, no more small ints), the operation still makes sense.

@markshannon
Copy link
Member

Ultimately we want to have virtual iterators, and specialization of FOR_ITER for lists, ranges and generators (and maybe others?).

If I were doing the whole thing myself, from scratch, I would do the following:

  1. Virtual iterators
  2. Specialize for lists and ranges.
  3. Specialize for generators.

Since we already have this PR, it probably makes sense to do 2 first, but I think we need to simplify FOR_ITER_RANGE first.

@sweeneyde sweeneyde force-pushed the special_for_iter2 branch from 14fbedb to d55868f Compare June 19, 2022 10:01
@markshannon
Copy link
Member

Pyperformance shows a 1% speedup with 35 faster, and 15 slower.

Details

+-------------------------+------------------------------------+-------------------------------------------------+
| Benchmark | 2022-06-18_20-47-main-6066f450b91f | 2022-06-19_10-01-special_for_iter2-d55868f9a354 |
+=========================+====================================+=================================================+
| scimark_sparse_mat_mult | 7.32 ms | 6.40 ms: 1.14x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| telco | 11.3 ms | 10.7 ms: 1.06x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| html5lib | 106 ms | 99.9 ms: 1.06x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_fft | 550 ms | 526 ms: 1.05x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_compile | 217 ms | 209 ms: 1.04x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_silent | 160 ns | 156 ns: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_monte_carlo | 111 ms | 108 ms: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| mako | 17.3 ms | 16.9 ms: 1.03x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| hexiom | 10.3 ms | 10.1 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_v8 | 36.1 ms | 35.5 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_effbot | 5.92 ms | 5.82 ms: 1.02x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_pure_python | 476 us | 470 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle | 17.6 us | 17.3 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| crypto_pyaes | 123 ms | 122 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| json_dumps | 21.1 ms | 20.9 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| regex_dna | 355 ms | 351 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_lu | 183 ms | 181 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| raytrace | 475 ms | 470 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| spectral_norm | 164 ms | 162 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle_pure_python | 343 us | 340 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_format | 11.0 us | 10.9 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| nqueens | 142 ms | 141 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| sympy_integrate | 33.6 ms | 33.4 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| xml_etree_iterparse | 165 ms | 164 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pyflate | 674 ms | 669 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| xml_etree_generate | 131 ms | 130 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| float | 123 ms | 122 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| scimark_sor | 192 ms | 191 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| deltablue | 5.52 ms | 5.48 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| meteor_contest | 173 ms | 172 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| logging_simple | 10.1 us | 10.1 us: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| sympy_str | 462 ms | 459 ms: 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| python_startup | 14.0 ms | 14.0 ms: 1.00x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| python_startup_no_site | 10.3 ms | 10.3 ms: 1.00x faster |
+-------------------------+------------------------------------+-------------------------------------------------+
| pidigits | 324 ms | 325 ms: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| 2to3 | 407 ms | 408 ms: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_list | 6.90 us | 6.92 us: 1.00x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| django_template | 53.6 ms | 53.9 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| genshi_text | 35.8 ms | 36.0 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| sqlite_synth | 4.04 us | 4.07 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| pickle_dict | 50.0 us | 50.4 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle | 21.5 us | 21.7 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| chameleon | 11.2 ms | 11.3 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpack_sequence | 75.5 ns | 76.3 ns: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| unpickle_list | 7.92 us | 8.02 us: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| dulwich_log | 105 ms | 106 ms: 1.01x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| fannkuch | 620 ms | 631 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| chaos | 112 ms | 114 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| go | 219 ms | 224 ms: 1.02x slower |
+-------------------------+------------------------------------+-------------------------------------------------+
| Geometric mean | (ref) | 1.01x faster |
+-------------------------+------------------------------------+-------------------------------------------------+

Benchmark hidden because not significant (12): tornado_http, sqlalchemy_imperative, sqlalchemy_declarative, sympy_sum, json_loads, sympy_expand, nbody, pathlib, xml_etree_process, genshi_xml, xml_etree_parse, richards

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor issues, otherwise looks good.

}

int
_PyLong_AssignValue(PyObject **target, long value)
Copy link
Member

@markshannon markshannon Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this Py_ssize_t rather than long?
On 64 bit windows, sizeof(long) != sizeof(void *) which makes reasoning about code with long harder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change PyRangeIterator to use Py_ssize_t as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately yes. But maybe not in this PR, since that isn't new code.

if (_PyLong_AssignValue(&GETLOCAL(_Py_OPARG(next)), value) < 0) {
goto error;
}
JUMPBY(INLINE_CACHE_ENTRIES_FOR_ITER + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why this is INLINE_CACHE_ENTRIES_FOR_ITER + 1 rather than just INLINE_CACHE_ENTRIES_FOR_ITER.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

Py_XDECREF(old);
return 0;
}
else if (old != NULL && PyLong_CheckExact(old) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works for positive integers.
Given that it is only used for range iteration, that's fine for now.
But could you add a comment, so it doesn't get overlooked when we implement faster-cpython/ideas#147

@markshannon
Copy link
Member

Thanks. Looks good.

@markshannon markshannon added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jun 20, 2022
@markshannon
Copy link
Member

I'll merge once the build bots are happy.

@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @markshannon for commit db21da1 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.12 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants