PERF: optimize block consolidation#64574
Merged
jorisvandenbossche merged 6 commits intopandas-dev:mainfrom Apr 20, 2026
Merged
Conversation
2 tasks
Member
Author
|
@jorisvandenbossche is the comment or anything else still unclear? |
mroeschke
reviewed
Apr 14, 2026
- _form_blocks: use dict-based grouping instead of itertools.groupby so non-consecutive same-dtype arrays are grouped in a single pass, avoiding a redundant _consolidate_inplace pass with its vstack+argsort - _consolidate_check: exit early on first duplicate dtype instead of collecting all dtypes into a list and set - _merge_blocks: skip argsort when mgr_locs are already monotonic, checked via libalgos.is_monotonic which short-circuits on first violation - _merge_blocks: use np.concatenate directly instead of np.vstack to avoid atleast_2d overhead (block values are always 2D) - _grouping_key: use dtype.name (str) for numpy dtypes and id(dtype) for extension dtypes to avoid expensive ExtensionDtype.__hash__ (e.g. CategoricalDtype hashes all categories) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dict-based grouping in _form_blocks now consolidates non-consecutive same-dtype arrays upfront, so reset_index + column selection no longer produces a different block arrangement. Remove the precondition assertion that checked for this rearrangement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f1b80e2 to
7cfc658
Compare
mroeschke
approved these changes
Apr 16, 2026
Comment on lines
-20
to
-25
| df0 = DataFrame({"A": ["x", "y"], "B": [1, 2], "C": ["w", "z"]}) | ||
| df1 = df0.reset_index()[["A", "B", "C"]] | ||
| if not using_infer_string: | ||
| # this assert verifies that the above operations have | ||
| # induced a block rearrangement | ||
| assert df0._mgr.blocks[0].dtype != df1._mgr.blocks[0].dtype |
Member
There was a problem hiding this comment.
Do you know of a different way to construct df1 and df2 to have a different block structure? Because if this assert no longer holds, there is not much point in keeping the test?
Member
Author
There was a problem hiding this comment.
updated to ensure different block structure
…erence Use float+int dtypes with sequential __setitem__ on df1 so the float columns stay in separate blocks, giving df0 (2 blocks) and df1 (3 blocks) a genuinely different block structure. Assert the block counts as a precondition so the test retains its purpose.
…orms Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jorisvandenbossche
approved these changes
Apr 20, 2026
Member
|
Thanks @jbrockmendel |
Sharl0tteIsTaken
added a commit
to Sharl0tteIsTaken/pandas
that referenced
this pull request
Apr 22, 2026
…h-origin * upstream/main: (31 commits) DOC:Missing r in your (pandas-dev#65323) DOC: fix grammar in the .dt accessor section (pandas-dev#65325) REGR: restore rank() for ExtensionArrays with custom values for sorting (pandas-dev#64976) BUG: MultiIndex.get_loc returns scalar for unique key in non-unique index (pandas-dev#65234) BUG/TST: add test for _cast_pointwise_result robustness + fix some cases (pandas-dev#65318) BUG: fix .loc with tuple key on MultiIndex with IntervalIndex level (pandas-dev#65239) BUG: permit building from source with mingw (pandas-dev#64849) BUG: DataFrame.loc setitem with list-like value on single-column EA DataFrame (pandas-dev#65241) PERF: preserve block memory layout in Block.copy (GH#60469) (pandas-dev#65302) PERF: short-circuit sort_index(level=...) on monotonic non-MultiIndex (pandas-dev#65279) BUG: fix FloatingArray.astype(str) crash with distinguish_nan_and_na=True (pandas-dev#65038) BUG: fix to_timedelta ignoring unit for mixed round/non-round floats (pandas-dev#65170) BUG: DataFrame.loc preserves original index name when key is an Index (pandas-dev#65229) REF: continue moving freq management off DatetimeArray/TimedeltaArray (GH#24566) (pandas-dev#65285) REF: remove redundant BaseMaskedArray.map override (pandas-dev#65297) Bump github/codeql-action from 4.35.1 to 4.35.2 (pandas-dev#65310) Bump actions/setup-node from 6.3.0 to 6.4.0 (pandas-dev#65309) BUG: Fix formatters applied to wrong columns in truncated DataFrame.to_string (GH#35410) (pandas-dev#65288) PERF: optimize block consolidation (pandas-dev#64574) CLN: Replace no_default signature with False for allow_duplicates in insert and reset_index (pandas-dev#65146) ...
4 tasks
jbrockmendel
added a commit
to jbrockmendel/pandas
that referenced
this pull request
Apr 24, 2026
The `_grouping_key` helper introduced in pandas-dev#64574 keyed the per-dtype grouping dict on `dtype.name` for numpy dtypes. Reading `np.dtype.name` walks through `_name_get` -> `_name_includes_bit_suffix` -> `issubdtype`, which costs ~225ns per call -- about 22x the cost of using the dtype object itself as the key (np.dtype already has a cheap hash/eq). Use the dtype directly as the key, and inline the (now trivial) helper to remove the per-array Python function call overhead. On `frame_ctor.FromArrays.time_frame_from_arrays_float` (1000x1000 float arrays) this takes the benchmark from ~1.9ms back under 1ms, restoring and slightly improving on 3.0.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_form_blocks: use dict-based grouping instead ofitertools.groupbyso non-consecutive same-dtype arrays are grouped in a single pass, avoiding a redundant_consolidate_inplacepass with its vstack+argsort_consolidate_check: exit early on first duplicate dtype instead of collecting all dtypes into a list and set_merge_blocks: skipnp.argsortwhenmgr_locsare already monotonic, checked vialibalgos.is_monotonicwhich short-circuits on first violation; usenp.concatenatedirectly instead ofnp.vstackto avoidatleast_2doverhead_grouping_key: usedtype.name(str) for numpy dtypes andid(dtype)for extension dtypes to avoid expensiveExtensionDtype.__hash__(e.g.CategoricalDtypehashes all categories)Benchmarks
DataFrame(500 cols, 3 alternating dtypes)consolidate()500-col unconsolidatedDataFrame(100 cols, mixed w/ big Categoricals)is_consolidated()500-block unconsolidatedTest plan
pandas/tests/internals/— 258 passedpandas/tests/frame/test_constructors.py— 1017 passedpandas/tests/frame/test_block_internals.py— passedpandas/tests/reshape/concat/— passed🤖 Generated with Claude Code