PERF: optimize block consolidation by jbrockmendel · Pull Request #64574 · pandas-dev/pandas

jbrockmendel · 2026-03-13T01:08:51Z

Summary

_form_blocks: use dict-based grouping instead of itertools.groupby so non-consecutive same-dtype arrays are grouped in a single pass, avoiding a redundant _consolidate_inplace pass with its vstack+argsort
_consolidate_check: exit early on first duplicate dtype instead of collecting all dtypes into a list and set
_merge_blocks: skip np.argsort when mgr_locs are already monotonic, checked via libalgos.is_monotonic which short-circuits on first violation; use np.concatenate directly instead of np.vstack to avoid atleast_2d overhead
_grouping_key: use dtype.name (str) for numpy dtypes and id(dtype) for extension dtypes to avoid expensive ExtensionDtype.__hash__ (e.g. CategoricalDtype hashes all categories)

Benchmarks

Benchmark	main	PR	Speedup
`DataFrame(500 cols, 3 alternating dtypes)`	5.08 ms	3.23 ms	1.6x
`consolidate()` 500-col unconsolidated	1.09 ms	0.65 ms	1.7x
`DataFrame(100 cols, mixed w/ big Categoricals)`	0.76 ms	0.45 ms	1.7x
`is_consolidated()` 500-block unconsolidated	56.1 µs	0.6 µs	~90x

Test plan

pandas/tests/internals/ — 258 passed
pandas/tests/frame/test_constructors.py — 1017 passed
pandas/tests/frame/test_block_internals.py — passed
pandas/tests/reshape/concat/ — passed

🤖 Generated with Claude Code

jbrockmendel · 2026-04-06T16:58:33Z

@jorisvandenbossche is the comment or anything else still unclear?

- _form_blocks: use dict-based grouping instead of itertools.groupby so non-consecutive same-dtype arrays are grouped in a single pass, avoiding a redundant _consolidate_inplace pass with its vstack+argsort - _consolidate_check: exit early on first duplicate dtype instead of collecting all dtypes into a list and set - _merge_blocks: skip argsort when mgr_locs are already monotonic, checked via libalgos.is_monotonic which short-circuits on first violation - _merge_blocks: use np.concatenate directly instead of np.vstack to avoid atleast_2d overhead (block values are always 2D) - _grouping_key: use dtype.name (str) for numpy dtypes and id(dtype) for extension dtypes to avoid expensive ExtensionDtype.__hash__ (e.g. CategoricalDtype hashes all categories) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The dict-based grouping in _form_blocks now consolidates non-consecutive same-dtype arrays upfront, so reset_index + column selection no longer produces a different block arrangement. Remove the precondition assertion that checked for this rearrangement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jorisvandenbossche · 2026-04-16T16:23:37Z

        df0 = DataFrame({"A": ["x", "y"], "B": [1, 2], "C": ["w", "z"]})
        df1 = df0.reset_index()[["A", "B", "C"]]
-        if not using_infer_string:
-            # this assert verifies that the above operations have
-            # induced a block rearrangement
-            assert df0._mgr.blocks[0].dtype != df1._mgr.blocks[0].dtype


Do you know of a different way to construct df1 and df2 to have a different block structure? Because if this assert no longer holds, there is not much point in keeping the test?

updated to ensure different block structure

…erence Use float+int dtypes with sequential __setitem__ on df1 so the float columns stay in separate blocks, giving df0 (2 blocks) and df1 (3 blocks) a genuinely different block structure. Assert the block counts as a precondition so the test retains its purpose.

…orms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jorisvandenbossche · 2026-04-20T13:59:11Z

Thanks @jbrockmendel

…h-origin * upstream/main: (31 commits) DOC:Missing r in your (pandas-dev#65323) DOC: fix grammar in the .dt accessor section (pandas-dev#65325) REGR: restore rank() for ExtensionArrays with custom values for sorting (pandas-dev#64976) BUG: MultiIndex.get_loc returns scalar for unique key in non-unique index (pandas-dev#65234) BUG/TST: add test for _cast_pointwise_result robustness + fix some cases (pandas-dev#65318) BUG: fix .loc with tuple key on MultiIndex with IntervalIndex level (pandas-dev#65239) BUG: permit building from source with mingw (pandas-dev#64849) BUG: DataFrame.loc setitem with list-like value on single-column EA DataFrame (pandas-dev#65241) PERF: preserve block memory layout in Block.copy (GH#60469) (pandas-dev#65302) PERF: short-circuit sort_index(level=...) on monotonic non-MultiIndex (pandas-dev#65279) BUG: fix FloatingArray.astype(str) crash with distinguish_nan_and_na=True (pandas-dev#65038) BUG: fix to_timedelta ignoring unit for mixed round/non-round floats (pandas-dev#65170) BUG: DataFrame.loc preserves original index name when key is an Index (pandas-dev#65229) REF: continue moving freq management off DatetimeArray/TimedeltaArray (GH#24566) (pandas-dev#65285) REF: remove redundant BaseMaskedArray.map override (pandas-dev#65297) Bump github/codeql-action from 4.35.1 to 4.35.2 (pandas-dev#65310) Bump actions/setup-node from 6.3.0 to 6.4.0 (pandas-dev#65309) BUG: Fix formatters applied to wrong columns in truncated DataFrame.to_string (GH#35410) (pandas-dev#65288) PERF: optimize block consolidation (pandas-dev#64574) CLN: Replace no_default signature with False for allow_duplicates in insert and reset_index (pandas-dev#65146) ...

The `_grouping_key` helper introduced in pandas-dev#64574 keyed the per-dtype grouping dict on `dtype.name` for numpy dtypes. Reading `np.dtype.name` walks through `_name_get` -> `_name_includes_bit_suffix` -> `issubdtype`, which costs ~225ns per call -- about 22x the cost of using the dtype object itself as the key (np.dtype already has a cheap hash/eq). Use the dtype directly as the key, and inline the (now trivial) helper to remove the per-array Python function call overhead. On `frame_ctor.FromArrays.time_frame_from_arrays_float` (1000x1000 float arrays) this takes the benchmark from ~1.9ms back under 1ms, restoring and slightly improving on 3.0.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jorisvandenbossche reviewed Mar 13, 2026

View reviewed changes

Comment thread pandas/core/internals/managers.py

jorisvandenbossche reviewed Mar 13, 2026

View reviewed changes

Comment thread pandas/core/internals/managers.py

mroeschke requested a review from jorisvandenbossche March 14, 2026 16:59

jbrockmendel added the Performance Memory or execution speed performance label Mar 15, 2026

jbrockmendel mentioned this pull request Apr 2, 2026

PERF: Optimize BlockManager metadata operations and dtype inference #65035

Closed

2 tasks

mroeschke reviewed Apr 14, 2026

View reviewed changes

Comment thread pandas/core/internals/managers.py Outdated

jbrockmendel and others added 4 commits April 16, 2026 08:09

CLN: remove stale type: ignore comment

ff75f40

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

more concise comment

7cfc658

jbrockmendel force-pushed the perf-consolidate branch from f1b80e2 to 7cfc658 Compare April 16, 2026 15:09

mroeschke added this to the 3.1 milestone Apr 16, 2026

mroeschke approved these changes Apr 16, 2026

View reviewed changes

jorisvandenbossche reviewed Apr 16, 2026

View reviewed changes

jbrockmendel and others added 2 commits April 16, 2026 10:08

TST: pin int64 dtype in test_equals_different_blocks for 32-bit platf…

c7a6b72

…orms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mroeschke requested a review from jorisvandenbossche April 17, 2026 16:03

jorisvandenbossche approved these changes Apr 20, 2026

View reviewed changes

jorisvandenbossche merged commit e6683f7 into pandas-dev:main Apr 20, 2026
45 checks passed

jbrockmendel deleted the perf-consolidate branch April 20, 2026 14:51

jbrockmendel mentioned this pull request Apr 22, 2026

PERF: avoid dtype.name in DataFrame block grouping #65337

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PERF: optimize block consolidation#64574

PERF: optimize block consolidation#64574
jorisvandenbossche merged 6 commits intopandas-dev:mainfrom
jbrockmendel:perf-consolidate

jbrockmendel commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

jbrockmendel commented Apr 6, 2026

Uh oh!

Uh oh!

jorisvandenbossche Apr 16, 2026

Uh oh!

jbrockmendel Apr 17, 2026

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jbrockmendel commented Mar 13, 2026

Summary

Benchmarks

Test plan

Uh oh!

Uh oh!

Uh oh!

jbrockmendel commented Apr 6, 2026

Uh oh!

Uh oh!

jorisvandenbossche Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants