Skip to content

PERF: optimize block consolidation#64574

Merged
jorisvandenbossche merged 6 commits intopandas-dev:mainfrom
jbrockmendel:perf-consolidate
Apr 20, 2026
Merged

PERF: optimize block consolidation#64574
jorisvandenbossche merged 6 commits intopandas-dev:mainfrom
jbrockmendel:perf-consolidate

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

  • _form_blocks: use dict-based grouping instead of itertools.groupby so non-consecutive same-dtype arrays are grouped in a single pass, avoiding a redundant _consolidate_inplace pass with its vstack+argsort
  • _consolidate_check: exit early on first duplicate dtype instead of collecting all dtypes into a list and set
  • _merge_blocks: skip np.argsort when mgr_locs are already monotonic, checked via libalgos.is_monotonic which short-circuits on first violation; use np.concatenate directly instead of np.vstack to avoid atleast_2d overhead
  • _grouping_key: use dtype.name (str) for numpy dtypes and id(dtype) for extension dtypes to avoid expensive ExtensionDtype.__hash__ (e.g. CategoricalDtype hashes all categories)

Benchmarks

Benchmark main PR Speedup
DataFrame(500 cols, 3 alternating dtypes) 5.08 ms 3.23 ms 1.6x
consolidate() 500-col unconsolidated 1.09 ms 0.65 ms 1.7x
DataFrame(100 cols, mixed w/ big Categoricals) 0.76 ms 0.45 ms 1.7x
is_consolidated() 500-block unconsolidated 56.1 µs 0.6 µs ~90x

Test plan

  • pandas/tests/internals/ — 258 passed
  • pandas/tests/frame/test_constructors.py — 1017 passed
  • pandas/tests/frame/test_block_internals.py — passed
  • pandas/tests/reshape/concat/ — passed

🤖 Generated with Claude Code

Comment thread pandas/core/internals/managers.py
Comment thread pandas/core/internals/managers.py
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Mar 15, 2026
@jbrockmendel
Copy link
Copy Markdown
Member Author

@jorisvandenbossche is the comment or anything else still unclear?

Comment thread pandas/core/internals/managers.py Outdated
jbrockmendel and others added 4 commits April 16, 2026 08:09
- _form_blocks: use dict-based grouping instead of itertools.groupby
  so non-consecutive same-dtype arrays are grouped in a single pass,
  avoiding a redundant _consolidate_inplace pass with its vstack+argsort
- _consolidate_check: exit early on first duplicate dtype instead of
  collecting all dtypes into a list and set
- _merge_blocks: skip argsort when mgr_locs are already monotonic,
  checked via libalgos.is_monotonic which short-circuits on first
  violation
- _merge_blocks: use np.concatenate directly instead of np.vstack to
  avoid atleast_2d overhead (block values are always 2D)
- _grouping_key: use dtype.name (str) for numpy dtypes and id(dtype)
  for extension dtypes to avoid expensive ExtensionDtype.__hash__
  (e.g. CategoricalDtype hashes all categories)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dict-based grouping in _form_blocks now consolidates non-consecutive
same-dtype arrays upfront, so reset_index + column selection no longer
produces a different block arrangement. Remove the precondition assertion
that checked for this rearrangement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mroeschke mroeschke added this to the 3.1 milestone Apr 16, 2026
Comment on lines -20 to -25
df0 = DataFrame({"A": ["x", "y"], "B": [1, 2], "C": ["w", "z"]})
df1 = df0.reset_index()[["A", "B", "C"]]
if not using_infer_string:
# this assert verifies that the above operations have
# induced a block rearrangement
assert df0._mgr.blocks[0].dtype != df1._mgr.blocks[0].dtype
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know of a different way to construct df1 and df2 to have a different block structure? Because if this assert no longer holds, there is not much point in keeping the test?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to ensure different block structure

jbrockmendel and others added 2 commits April 16, 2026 10:08
…erence

Use float+int dtypes with sequential __setitem__ on df1 so the float
columns stay in separate blocks, giving df0 (2 blocks) and df1 (3 blocks)
a genuinely different block structure. Assert the block counts as a
precondition so the test retains its purpose.
…orms

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jorisvandenbossche jorisvandenbossche merged commit e6683f7 into pandas-dev:main Apr 20, 2026
45 checks passed
@jorisvandenbossche
Copy link
Copy Markdown
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the perf-consolidate branch April 20, 2026 14:51
Sharl0tteIsTaken added a commit to Sharl0tteIsTaken/pandas that referenced this pull request Apr 22, 2026
…h-origin

* upstream/main: (31 commits)
  DOC:Missing r in your (pandas-dev#65323)
  DOC: fix grammar in the .dt accessor section (pandas-dev#65325)
  REGR: restore rank() for ExtensionArrays with custom values for sorting (pandas-dev#64976)
  BUG: MultiIndex.get_loc returns scalar for unique key in non-unique index (pandas-dev#65234)
  BUG/TST: add test for _cast_pointwise_result robustness + fix some cases (pandas-dev#65318)
  BUG: fix .loc with tuple key on MultiIndex with IntervalIndex level (pandas-dev#65239)
  BUG: permit building from source with mingw (pandas-dev#64849)
  BUG: DataFrame.loc setitem with list-like value on single-column EA DataFrame (pandas-dev#65241)
  PERF: preserve block memory layout in Block.copy (GH#60469) (pandas-dev#65302)
  PERF: short-circuit sort_index(level=...) on monotonic non-MultiIndex (pandas-dev#65279)
  BUG: fix FloatingArray.astype(str) crash with distinguish_nan_and_na=True (pandas-dev#65038)
  BUG: fix to_timedelta ignoring unit for mixed round/non-round floats (pandas-dev#65170)
  BUG: DataFrame.loc preserves original index name when key is an Index (pandas-dev#65229)
  REF: continue moving freq management off DatetimeArray/TimedeltaArray (GH#24566) (pandas-dev#65285)
  REF: remove redundant BaseMaskedArray.map override (pandas-dev#65297)
  Bump github/codeql-action from 4.35.1 to 4.35.2 (pandas-dev#65310)
  Bump actions/setup-node from 6.3.0 to 6.4.0 (pandas-dev#65309)
  BUG: Fix formatters applied to wrong columns in truncated DataFrame.to_string (GH#35410) (pandas-dev#65288)
  PERF: optimize block consolidation (pandas-dev#64574)
  CLN: Replace no_default signature with False for allow_duplicates in insert and reset_index (pandas-dev#65146)
  ...
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Apr 24, 2026
The `_grouping_key` helper introduced in pandas-dev#64574 keyed the per-dtype
grouping dict on `dtype.name` for numpy dtypes. Reading `np.dtype.name`
walks through `_name_get` -> `_name_includes_bit_suffix` -> `issubdtype`,
which costs ~225ns per call -- about 22x the cost of using the dtype
object itself as the key (np.dtype already has a cheap hash/eq).

Use the dtype directly as the key, and inline the (now trivial) helper
to remove the per-array Python function call overhead. On
`frame_ctor.FromArrays.time_frame_from_arrays_float` (1000x1000 float
arrays) this takes the benchmark from ~1.9ms back under 1ms, restoring
and slightly improving on 3.0.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants