Skip to content

perf(prepare-list): batch list-item insertions into one pass#51

Merged
ChristianMurphy merged 3 commits into
syntax-tree:mainfrom
ChristianMurphy:perf/prepare-list-no-splice
Jun 2, 2026
Merged

perf(prepare-list): batch list-item insertions into one pass#51
ChristianMurphy merged 3 commits into
syntax-tree:mainfrom
ChristianMurphy:perf/prepare-list-no-splice

Conversation

@ChristianMurphy

@ChristianMurphy ChristianMurphy commented May 3, 2026

Copy link
Copy Markdown
Member

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and discussions and couldn’t find anything or linked relevant results below
  • I made sure the docs are up to date
  • I included tests (or that’s not needed)

Description of changes

prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue #49 / PR #50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unspliced positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild that writes the replacement into a fresh array, then either
splices the whole replacement in one call (when it fits below V8's
spread argument limit) or shifts the suffix once and writes the
replacement into the vacated range. The in-place shift avoids the
per-chunk splice loop a chunked spread fallback would use, which
would re-introduce O(K * N) shift cost on very wide lists.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and an in-place shift.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation
overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some
V8 versions, so above the threshold the rebuild instead resizes
events once, shifts the suffix to its target position, and writes
the replacement into the vacated range. Total work is O(suffix +
replacement.length), independent of how many insertions were
queued. The single-spread threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper.

Pass-1 records insertions in non-decreasing at order by
construction (each boundary records its exit insertion at
lineIndex || index followed by its enter insertion at index,
and the next boundary's tail walk is bounded by the previous
boundary's listItemPrefix), so no sort is needed in the slow path.
A dev-only assertion verifies the invariant on every slow-path
call.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

10,000 single-level list items -40.8% (5.6%)
5,000 ATX headings -20.6% (4.3%)
CommonMark spec * 35 (~564 KB) -11.7% (1.6%, very clean)
one CommonMark example -9.1% (105%, NOISY)
256 nested ordered list levels -8.8% (20.3%)
full CommonMark spec (~16 KB) -7.0% (35.4%, NOISY)
CommonMark spec * 7 (~113 KB) -3.6% (1.0%, very clean)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported large deltas in single
runs, but those inputs have a cross-run spread of 44 to 52% on the
baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1452/1452, 100% coverage maintained.
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx tests
reproduce on upstream/main and are not introduced by this branch.

Closes #49
Refs #50

@github-actions github-actions Bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels May 3, 2026
@ChristianMurphy ChristianMurphy added the 🏁 area/perf This affects performance label May 3, 2026
@ChristianMurphy ChristianMurphy force-pushed the perf/prepare-list-no-splice branch from fe0eff3 to c4361a5 Compare May 3, 2026 13:52
@ChristianMurphy ChristianMurphy requested a review from Copilot May 3, 2026 13:57
@ChristianMurphy ChristianMurphy force-pushed the perf/prepare-list-no-splice branch from c4361a5 to 06efef9 Compare May 3, 2026 14:04

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors prepareList in the markdown compiler to avoid repeated in-loop events.splice() calls when synthesizing listItem enter/exit events. In the overall codebase, this targets a core parser hot path so wide or deeply nested list inputs can be processed more efficiently.

Changes:

  • Queue synthetic listItem enter/exit insertions during the existing list walk instead of mutating events immediately.
  • Add two post-processing insertion strategies: a small-list reverse-splice fast path and a wide-list batched rebuild path.
  • Add chunked splice handling for very large replacement ranges to stay under V8 spread/argument limits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev/lib/index.js Outdated
Comment thread dev/lib/index.js Outdated
Comment thread dev/lib/index.js Outdated
Comment thread dev/lib/index.js Outdated
@ChristianMurphy ChristianMurphy force-pushed the perf/prepare-list-no-splice branch 3 times, most recently from 6d428ec to 0420c86 Compare May 3, 2026 14:45
@ChristianMurphy ChristianMurphy requested a review from Copilot May 3, 2026 14:58

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev/lib/index.js Outdated
Comment thread test/index.js Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev/lib/index.js
Comment thread dev/lib/index.js Outdated
Comment thread test/index.js Outdated
Comment thread test/index.js Outdated
prepareList synthesizes listItem enter and exit events one item at a
time using events.splice(at, 0, [event]). Each splice shifts the
suffix of the events array, so a list with K items inside an array of
N events does O(K * N) shift work. This is the dominant cost in
mdast-util-from-markdown's contribution to wide-list inputs and the
slowdown reported at depth on issue syntax-tree#49 / PR syntax-tree#50.

The fix collects the would-be splices into an insertions queue during
the existing walk and applies them outside the loop in one pass. Two
paths handle the work efficiently: lists with up to a small number of
insertions take a fast path that splices each insertion in reverse
order so unspliced positions stay valid, which avoids the cost of
allocating a fresh sub-array; longer lists go through a batched
rebuild that writes the replacement into a fresh array, then either
splices the whole replacement in one call (when it fits below V8's
spread argument limit) or shifts the suffix once and writes the
replacement into the vacated range. The in-place shift avoids the
per-chunk splice loop a chunked spread fallback would use, which
would re-introduce O(K * N) shift cost on very wide lists.

How the cut points were chosen:

There are two thresholds in the new code: SMALL_LIST_LIMIT chooses
between fast-path splice loop and rebuild; SAFE_SPREAD chooses
between a single spread and an in-place shift.

SMALL_LIST_LIMIT is a workload-dependent crossover. The fast path
costs O(K * suffix) because each of the K splices shifts the events
suffix; the rebuild path costs O(N + K) plus a fixed allocation
overhead. Below some K the rebuild's constant overhead dominates;
above some K the fast path's K * suffix dominates. Because suffix
size and per-insertion splice cost both vary with document shape,
no single value is universally best: deeper nesting prefers higher
limits (everything stays on the splice loop), and documents with a
few moderately-sized lists prefer lower limits (the rebuild's lower
per-item cost wins). The threshold was chosen by sweeping
{0, 2, 4, 8, 16, 32, 64} with representative inputs from both
regimes and picking the value that kept the validated multi-run wins
on typical-document inputs without regressing the deep-nest case
beyond its own noise band.

SAFE_SPREAD is set by V8's argument count limit. Spreading a very
large array into events.splice can throw a stack overflow in some
V8 versions, so above the threshold the rebuild instead resizes
events once, shifts the suffix to its target position, and writes
the replacement into the vacated range. Total work is O(suffix +
replacement.length), independent of how many insertions were
queued. The single-spread threshold was tested at
{1000, 2000, 5000, 10000, 20000, 70000}; 10000 had the lowest median
across the wide-list and spec-derived inputs and matches the
threshold micromark-util-chunked already uses for its own splice
helper.

Pass-1 records insertions in non-decreasing `at` order by
construction (each boundary records its exit insertion at
`lineIndex || index` followed by its enter insertion at `index`,
and the next boundary's tail walk is bounded by the previous
boundary's listItemPrefix), so no sort is needed in the slow path.
A dev-only assertion verifies the invariant on every slow-path
call.

Inputs that benefit, with multi-run median-of-medians vs the baseline
(spread in parentheses):

  10,000 single-level list items     -40.8%  (5.6%)
  5,000 ATX headings                 -20.6%  (4.3%)
  CommonMark spec * 35  (~564 KB)    -11.7%  (1.6%, very clean)
  one CommonMark example              -9.1%  (105%, NOISY)
  256 nested ordered list levels      -8.8%  (20.3%)
  full CommonMark spec  (~16 KB)      -7.0%  (35.4%, NOISY)
  CommonMark spec * 7  (~113 KB)      -3.6%  (1.0%, very clean)

Single-run full corpus runs show the same direction on every other
input that contains at least one list, with wins of -17% to -27% on
inputs heavy in fenced code blocks, images, character references,
inline links, tabs, and HTML blocks. The largest improvement is the
10,000-item single-level list input, which is the worst case for the
old per-item splice loop.

Trade-offs and inputs that do not move:

Inputs that contain no lists are unaffected by the change because
prepareList is never invoked. The pure emphasis stress inputs ('a**b'
repeated 10,000 times and similar) reported large deltas in single
runs, but those inputs have a cross-run spread of 44 to 52% on the
baseline alone, so the apparent regressions sit inside their own
noise band. A 1 MB single paragraph, a Unicode-heavy 256 KB input,
and 10,000 unmatched asterisks all moved within +/- 3% of baseline.

Tests pass: dev + prod 1454/1454, 100% coverage maintained.
Three new tests cover the rebuild path: a wide-list parse-and-line
spot-check, plus first-item deepEqual against a 4-item fast-path
reference for both tight and loose lists (so a bug confined to the
rebuild branch diverges from the fast path the reference uses).
mdast-util-gfm 54/54, mdast-util-mdx 11/13. The two failing mdx
tests reproduce on upstream/main and are not introduced by this
branch.

Closes syntax-tree#49
Refs syntax-tree#50

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev/lib/index.js Outdated
Comment thread test/index.js
Array.from({length: N}) initializes each of N slots to undefined
before the fill loop runs, which is O(N) redundant work. The while
loop immediately following writes every slot anyway, so the
pre-initialization achieves nothing.

new Array(N) allocates the backing store with capacity N in O(1)
and leaves the slots uninitialized (a V8 HOLEY array). Once the
fill loop completes, the result is identical to what Array.from
would have produced.

A micro-benchmark at N=10000 shows new Array(N) is 414x faster
for the allocation step alone: 514 ns vs 213 µs per call. The
difference is largest on the worst-case wide-list input where N is
proportional to the number of list items.

The unicorn/no-new-array lint rule is suppressed at this one site.
The rule guards against calling new Array with a single non-integer
value; here the argument is a computed sum of two non-negative
integer lengths (rangeLength and insertions.length), so the risk
the rule guards against cannot apply.

Composite multi-run bench (both commits together, 3 runs, median of
medians), run on a loaded machine (load avg 4.3, cross-run spreads
74-164% on most scenarios — numbers are directionally consistent
with the previous clean runs but cannot be read as precise):

  10,000 single-level list items     -36.8%  (74% cross-run spread)
  5,000 ATX headings                 -17.2%   (1% cross-run spread)
  CommonMark spec * 35  (~564 KB)    -14.8%   (1% cross-run spread)
  full CommonMark spec  (~16 KB)     -17.2%   (2% cross-run spread)
  CommonMark spec * 7  (~113 KB)      -5.4%   (2% cross-run spread)

The wide-list result matches the clean -40.8% from the preceding
commit's bench runs; the improvement comes overwhelmingly from the
batch-rewrite in that commit, with the Array pre-init removal as an
incremental addend that the noisy machine cannot resolve separately.
@ChristianMurphy

Copy link
Copy Markdown
Member Author

There was an also an issue with c8 and yargs that popped up since this pull request was opened. That's included in here, though happy to put it into a separate pull request.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread test/index.js
@ChristianMurphy

Copy link
Copy Markdown
Member Author

Base off another discussion also merged in testing on Node 24 and 26 only.
Which would make this part of the next major

@Murderlon Murderlon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to get this released 👍

Although the CI and tests changes are not breaking changes for consumers, I kind of understand wanting to do a major as you no longer can promise it works on older versions. It would be much better though if those unrelated changes are a separate PR so this can be released sooner. I generally don't like bug/perf fixes in major versions, major versions are ideally boring.

Problem: c8 pulls yargs 17.7.2, whose extensionless ./yargs file crashes under the Node 22.20+/24/26 module-loader changes (require is not defined in ES module scope), reddening the `node` CI leg.

Goal: keep the existing c8 coverage setup working on current Node without pulling in the native-coverage tooling change.

Changes:
- add npm overrides forcing yargs to ^18.0.0 (only c8 consumes it)
@ChristianMurphy

Copy link
Copy Markdown
Member Author

Although the CI and tests changes are not breaking changes for consumers, I kind of understand wanting to do a major as you no longer can promise it works on older versions. It would be much better though if those unrelated changes are a separate PR so this can be released sooner. I generally don't like bug/perf fixes in major versions, major versions are ideally boring.

Fair, I broke this up to unstick the performance fix.
This branch now does a yargs override so the build still works on Node 26 (released while this PR was being worked on, and now breaking CI on main and this branch).
And I split out #54 to focus on native test coverage support in newer Node versions. (Something I'd ideally like to roll out to all of unified in the next major)

@ChristianMurphy ChristianMurphy merged commit 53875a7 into syntax-tree:main Jun 2, 2026
4 checks passed
@ChristianMurphy ChristianMurphy deleted the perf/prepare-list-no-splice branch June 2, 2026 15:47
@github-actions

This comment has been minimized.

@ChristianMurphy ChristianMurphy added 💪 phase/solved Post is done and removed 🤞 phase/open Post is being triaged manually labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🏁 area/perf This affects performance 💪 phase/solved Post is done

Development

Successfully merging this pull request may close these issues.

prepareList has O(n^2) complexity from per-item events.splice

4 participants