Skip to content

feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5

Closed
lilith wants to merge 2 commits into
mainfrom
deinterleave-f32-chunks
Closed

feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5
lilith wants to merge 2 commits into
mainfrom
deinterleave-f32-chunks

Conversation

@lilith
Copy link
Copy Markdown
Member

@lilith lilith commented May 2, 2026

Summary

This PR (re-titled and expanded) lands two layers of work on src/deinterleave.rs:

  1. (Original PR feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles) #5) 12 new public scalar chunk-level f32 RGB/RGBA deinterleave + interleave functions at chunk widths {4, 8, 16}.
  2. (New) Per-arch SIMD specializations of those chunks — real vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 on AArch64; real vshufps / vinsertps / vblendps / vunpcklps / vmovlhps on x86_64 AVX2; real i8x16.shuffle / v128.load / v128.store on wasm32 SIMD128. The slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 #[arcane] dispatchers (which previously called the scalar loop body verbatim under a target_feature = "avx2" / "neon" region) now route through these chunk SIMD primitives via a 16 → 8 → 4 → scalar tail pipeline.

Public API

Twelve new shape-keys per chunk variant — chunk-{4, 8, 16} × {RGB, RGBA} × {deinterleave, interleave}:

{rgb,rgba}_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}   (24 deinterleavers)
planes_to_{rgb,rgba}_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}   (24 interleavers)

The bare-name aliases from the original PR #5 (rgb_f32_chunk4_to_planes etc.) are preserved as thin wrappers around *_scalar so existing callers compile untouched.

cargo semver-checks check-release --all-features reports Summary no semver update required — purely additive.

Codegen verification

#[rite] on every per-arch chunk function so they fuse into the caller's #[arcane] / target_feature region — no call / b instruction at the chunk boundary. Verified by tests/asm_inline_check.rs which dumps a sample_caller_*_chunk16 body and asserts the chunk function's SIMD ops appear inline.

x86_64 AVX2 — `sample_caller_v3_rgb_chunk16` (chunk-16 RGB deinterleave, fully inlined)
asm_inline_check::x86_inline::__arcane_sample_caller_v3_rgb_chunk16:
    vmovups xmm2, xmmword ptr [rsi]
    vmovups xmm3, xmmword ptr [rsi + 16]
    vmovups xmm4, xmmword ptr [rsi + 48]
    vmovups xmm5, xmmword ptr [rsi + 64]
    vinsertps xmm0, xmm3, xmm2, 140
    vshufps xmm0, xmm0, xmmword ptr [rsi + 32], 196
    vblendps xmm1, xmm3, xmm2, 2
    vshufps xmm1, xmm1, xmm1, 241
    vbroadcastss xmm6, dword ptr [rsi + 40]
    vblendps xmm1, xmm1, xmm6, 8
    vshufps xmm3, xmm2, xmm3, 236
    ...  (4× chunk-4 sub-bodies, 56 SIMD ops total)
    vmovups xmmword ptr [rdi], xmm6
    vmovups xmmword ptr [rdi + 16], xmm4
    ...  (12× 16-byte stores)
    ret
  • 56 SIMD ops in the body (vmovups / vshufps / vblendps / vinsertps / vbroadcastss).
  • 0 call instructions to any *_chunk*_v3 symbol — fully inlined.
  • 0 scalar vmovss / movss loads.
aarch64 NEON — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 NEON)
.LBB8_8:
    cmp x8, x1
    b.hi .LBB8_30
    mov x11, x9
    sub x16, x9, #96
    add x8, x8, #48
    ld3 { v0.4s, v1.4s, v2.4s }, [x11], #48
    ld3 { v3.4s, v4.4s, v5.4s }, [x16]
    ld3 { v19.4s, v20.4s, v21.4s }, [x11]
    ld3 { v16.4s, v17.4s, v18.4s }, [x16]
    ...
    stp q3, q16, [x13, #-32]
    stp q0, q19, [x13], #64
    ...
    b.ls .LBB8_8
  • ld3 { v0.4s, v1.4s, v2.4s } per inner-loop iteration (= unrolled 2 × chunk-16) — exactly the hardware structure-load.
  • 0 scalar ldr s0 loads in the inner loop.
  • The RGBA path lights up ld4 { v0.4s, v1.4s, v2.4s, v3.4s } instead — 8 in the chunk-16 unrolled body.
  • The interleave path emits st3 / st4 symmetrically (8 each in the unrolled chunk-16 inner body).
wasm32 SIMD128 — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 wasm128)

WAT histogram across the full slice-level body (built with RUSTFLAGS="-C target-feature=+simd128"):

   42  i8x16.shuffle    ← cross-lane permute (i32x4_shuffle macro lowers to i8x16.shuffle)
   33  v128.load
   24  v128.store
    3  f32.load          ← scalar tail (< 4 pixels)
    3  f32.store         ← scalar tail

Test plan

  • cargo test --release --features experimental (x86_64) — 228 lib tests pass, including 12 byte-exact *_v3 vs *_scalar parity tests for the new SIMD chunks.
  • cross test --release --features experimental --target aarch64-unknown-linux-gnu (qemu) — 209 lib tests pass, including 12 byte-exact *_neon vs *_scalar parity tests.
  • RUSTFLAGS="-C target-feature=+simd128" cargo test --target wasm32-wasip1 --features experimental --release (wasmtime) — 209 lib tests pass, including 12 byte-exact *_wasm128 vs *_scalar parity tests.
  • tests/asm_inline_check.rs — runs as a smoke test on each arch when the host has the required ISA; verifies the #[rite] chunk functions inline cleanly into a sample #[arcane] caller. Passes on x86_64 native + aarch64 cross.
  • cargo semver-checks check-release --all-featuresSummary no semver update required (purely additive).
  • cargo clippy --release --all-features --lib --tests — clean.

Existing tests retained (round-trip / slice-API-equivalence / order-preservation) — they now exercise the SIMD path on each arch instead of bare scalar.

Notes

  • No version bump in Cargo.toml; release cadence is the maintainer's call.
  • The wasm32 slice-level *_impl_wasm128 route is wired up via incant!(...[v3, neon, wasm128, scalar]) for symmetry; existing slice-level callers benefit on wasm32 with +simd128.
  • The chunk-level _scalar variants stay public so callers in non-SIMD regions (or with their own dispatch) can use them without going through the slice-level dispatcher.
  • Integrating into zenpixels-convert's narrow body (the original use case) is a separate follow-up.

…leave

Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input
and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave,
interleave} = 12 new public functions.

Each body is a single fixed-array literal expression; LLVM auto-vectorizes
the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64)
shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into
shuffles" pattern. No SIMD intrinsics, no unsafe code.

Use case: enables fused per-chunk TRC + 3×3 matrix kernels in
zenpixels-convert without manual deinterleave/interleave loops.
@lilith lilith self-assigned this May 2, 2026
lilith added a commit that referenced this pull request May 3, 2026
…shuffles

The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were
scalar stubs — the v3 / neon variants called the same loop body as
scalar, with only autovectorize as the SIMD path. Replace with real
hand-rolled per-arch chunk SIMD:

- aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware
  structure-load instructions, one per 4-pixel chunk. Confirmed
  ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the
  slice-level rgb_f32_to_planes_f32 and friends.

- x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps /
  vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via
  vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger
  chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen.
  cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss
  in the inlined chunk-16 path with zero scalar f32 ops.

- wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers
  to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm
  shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the
  slice-level body.

Each chunk-level SIMD function is #[rite] so it inlines into the
caller's #[arcane] / target_feature region with zero call/return
overhead — verified by tests/asm_inline_check.rs which asserts the
sample_caller_*_chunk16 body contains the chunk function's SIMD ops
inline (no call instruction).

Public API additions (purely additive — cargo semver-checks reports
no semver update required):

  rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}
  planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}

PR #5's no-suffix names are kept as thin wrappers around *_scalar so
existing callers compile untouched. Slice-level rgb_f32_to_planes_f32
/ rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32
now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar
tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries.

Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON,
wasm32 SIMD128) plus an asm-inline-check harness that runs as a
runtime smoke test under each target. All 209 wasm32-wasip1 lib
tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests
pass under qemu (cross), all 228 x86_64 lib tests pass natively.
@lilith lilith changed the title feat(deinterleave): add chunk-level f32 RGB/RGBA deinterleave + interleave feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles) May 3, 2026
…shuffles

The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were
scalar stubs — the v3 / neon variants called the same loop body as
scalar, with only autovectorize as the SIMD path. Replace with real
hand-rolled per-arch chunk SIMD:

- aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware
  structure-load instructions, one per 4-pixel chunk. Confirmed
  ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the
  slice-level rgb_f32_to_planes_f32 and friends.

- x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps /
  vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via
  vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger
  chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen.
  cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss
  in the inlined chunk-16 path with zero scalar f32 ops.

- wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers
  to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm
  shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the
  slice-level body.

Each chunk-level SIMD function is #[rite] so it inlines into the
caller's #[arcane] / target_feature region with zero call/return
overhead — verified by tests/asm_inline_check.rs which asserts the
sample_caller_*_chunk16 body contains the chunk function's SIMD ops
inline (no call instruction).

Public API additions (purely additive — cargo semver-checks reports
no semver update required):

  rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}
  planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}

PR #5's no-suffix names are kept as thin wrappers around *_scalar so
existing callers compile untouched. Slice-level rgb_f32_to_planes_f32
/ rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32
now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar
tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries.

Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON,
wasm32 SIMD128) plus an asm-inline-check harness that runs as a
runtime smoke test under each target. All 209 wasm32-wasip1 lib
tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests
pass under qemu (cross), all 228 x86_64 lib tests pass natively.
@lilith
Copy link
Copy Markdown
Member Author

lilith commented May 8, 2026

Superseded by #8 — same surface area but using tokenless #[rite(<tier>)] per the discussion in #7. Closing this in favor of #8.

@lilith lilith closed this May 8, 2026
lilith added a commit that referenced this pull request May 8, 2026
…gh autovec

Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written
f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 ×
_mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was
misleading — VEX-encoded SSE-style code, not 256-bit AVX2.

LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the
hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row
size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written
emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the
hand-written code can't because it was written with 128-bit intrinsics
throughout.

Changes:

  - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers
    around the inline scalar loop. LLVM autovec lifts to 256-bit YMM
    under target_feature avx2,fma.
  - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks`
    (~1180 lines). 36 hand-written chunk SIMD primitives removed before
    they ever published.
  - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks.
  - Delete tests/asm_inline_check.rs (the chunks it inline-checked are
    gone).
  - Delete 36 in-file per-arch parity tests in `mod tests`.
  - Strip 2 chunk-related bench groups from benches/deinterleave.rs;
    the chunk-size-choice and autovec-vs-chunk groups served their
    purpose (proved this refactor was right) and are now dead.

What stays:

  - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar`
    and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside
    their own `#[arcane(<tier>)]` region — autovec lifts them to YMM.
  - 4 slice-level f32 dispatchers (the public API).
  - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`,
    `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2
    (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze.

Caller migration:

  zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks):
  rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`.
  Caller is already inside `#[arcane(<tier>)]` so the scalar body
  autovecs to 256-bit YMM.

Verified:
  - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity
    tests + 12 lost from asm_inline_check, vs 35 net counted)
  - aarch64 cross + qemu: 201 tests pass
  - wasm32-wasip1 + wasmtime + simd128: 201 tests pass
  - clippy clean
  - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass

File-level diff:
  src/deinterleave.rs     -1180 lines (chunk modules) +12 dispatcher trims
  tests/asm_inline_check.rs  deleted (-277 lines)
  benches/deinterleave.rs  -451 lines (chunk-size + autovec-vs-chunk groups)
  zenpixels-convert/src/fast_gamut_v2.rs  12 call sites renamed

Tracking: #7
lilith added a commit that referenced this pull request May 8, 2026
The previous section described PR #5's never-shipped chunk-level
`*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32
chunks "hand-tuned AVX2". After bench-driven trimming both claims are
wrong. Replace with the actual v0.2.8 surface:

  - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2
    widening, 21-52% faster than autovec (kept).
  - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM /
    NEON / wasm128 (drop the 128-bit hand-written chunks; they lost
    to autovec by 26-37% at 1024px).
  - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD
    fusion, plus `_scalar` chunks for both u8/u16 and f32 callers
    inside their own #[arcane(<tier>)] region.

Cross-references the saved bench files for anyone wanting to see the
data behind the design choices.
lilith added a commit that referenced this pull request May 8, 2026
…-break v0.2.8 (#8)

* feat(deinterleave): add chunk-level f32 RGB/RGBA deinterleave + interleave

Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input
and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave,
interleave} = 12 new public functions.

Each body is a single fixed-array literal expression; LLVM auto-vectorizes
the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64)
shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into
shuffles" pattern. No SIMD intrinsics, no unsafe code.

Use case: enables fused per-chunk TRC + 3×3 matrix kernels in
zenpixels-convert without manual deinterleave/interleave loops.

* feat(deinterleave): real SIMD f32 RGB/RGBA via NEON ld3q + AVX2/WASM shuffles

The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were
scalar stubs — the v3 / neon variants called the same loop body as
scalar, with only autovectorize as the SIMD path. Replace with real
hand-rolled per-arch chunk SIMD:

- aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware
  structure-load instructions, one per 4-pixel chunk. Confirmed
  ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the
  slice-level rgb_f32_to_planes_f32 and friends.

- x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps /
  vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via
  vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger
  chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen.
  cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss
  in the inlined chunk-16 path with zero scalar f32 ops.

- wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers
  to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm
  shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the
  slice-level body.

Each chunk-level SIMD function is #[rite] so it inlines into the
caller's #[arcane] / target_feature region with zero call/return
overhead — verified by tests/asm_inline_check.rs which asserts the
sample_caller_*_chunk16 body contains the chunk function's SIMD ops
inline (no call instruction).

Public API additions (purely additive — cargo semver-checks reports
no semver update required):

  rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}
  planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}
  planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}

PR #5's no-suffix names are kept as thin wrappers around *_scalar so
existing callers compile untouched. Slice-level rgb_f32_to_planes_f32
/ rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32
now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar
tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries.

Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON,
wasm32 SIMD128) plus an asm-inline-check harness that runs as a
runtime smoke test under each target. All 209 wasm32-wasip1 lib
tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests
pass under qemu (cross), all 228 x86_64 lib tests pass natively.

* wip(deinterleave): partial rename of NEON chunk fns to _tokenless_<tier>

Found in `~/work/garb--cold-pr5` worktree on `feat/tokenless-rite-deinterleave`
branch as uncommitted edits when picking up to draft v0.2.8. The diff renames
7 NEON chunk fns (chunk4 × 4 + chunk8 × 3) from `_neon(_t: NeonToken, ...)`
form to `#[rite(neon)] ..._tokenless_neon(...)` form — exactly the convention
the v0.2.8 tracking issue (#7) prescribes for moving archmage Token types out
of garb's public signatures.

Preserving as-is per CLAUDE.md "NEVER DESTROY UNCOMMITTED WORK". Subsequent
commits on this branch finish the rename across the remaining v3 / wasm128 /
chunk16 / NEON-chunk8-RGBA-interleave fns plus the old u8/u16 _v3 chunks.

Note: this partial state does NOT compile by itself — the internal callers
in `arm_f32::*_impl_neon` still call the old names. Subsequent commits fix
that as part of finishing the rename.

* refactor(deinterleave): drop 12 bare-name f32 chunk aliases

PR #5 shipped one-line forwarders rgb_f32_chunk{4,8,16}_to_planes etc. that
delegated to the matching _scalar variant. They had zero callers anywhere in
the zen workspace (verified via rg) and the naming misled readers into
thinking they were runtime dispatchers when they were just scalar thunks.

Drop them. Tests that called the bare names now call _scalar directly. The
section's leading comment block updated to document the new convention
(_scalar + _tokenless_<tier>) and explain why the bare _v3/_neon/_wasm128
suffix is reserved for the token-accepting form.

Tracking: #7

* feat(deinterleave): drop archmage Token types from public API (BREAKING)

Replace 36 token-taking f32 chunk fns + 2 u8/u16 token-taking chunks with
tokenless `#[rite(<tier>)]` equivalents named `_tokenless_<tier>`. Garb's
public surface is now archmage-token-free.

Public API change in `garb::deinterleave::`:

  REMOVED:
    rgb24_chunk8_to_planes_v3(_t: X64V3Token, ...)
    rgb48_chunk8_to_planes_v3(_t: X64V3Token, ...)

  REPLACED WITH:
    rgb24_chunk8_to_planes_tokenless_v3(...)         #[rite(v3)]
    rgb48_chunk8_to_planes_tokenless_v3(...)         #[rite(v3)]

  ADDED (PR #5 surface, no archmage types):
    {rgb,rgba}_f32_chunk{4,8,16}_to_planes_tokenless_{v3,neon,wasm128}
    planes_to_{rgb,rgba}_f32_chunk{4,8,16}_tokenless_{v3,neon,wasm128}
    {rgb,rgba}_f32_chunk{4,8,16}_to_planes_scalar
    planes_to_{rgb,rgba}_f32_chunk{4,8,16}_scalar
    {rgb,rgba}_f32_to_planes_f32 / planes_f32_to_{rgb,rgba}_f32  (slice-level)

Why tokenless: archmage's tier-based `#[rite(<tier>)]` form adds
`#[target_feature]` + `#[inline]` + `#[cfg(target_arch)]` automatically.
Functions decorated this way are safe to call from any matching `#[arcane]`
or `#[rite]` region (Rust 1.86+ relaxation), inline cleanly into the
caller's SIMD body, and have no archmage type in their signature. This
decouples garb's semver from archmage's. Today's downstream callers
(zenanalyze, zenpixels-convert/fast-gamut-refactor, zenpipe--garb-deinterleave/
zenfilters) all already directly depend on archmage themselves, so the
removal of tokens from garb's signatures imposes zero new import burden —
they just stop passing the token they already have.

Why "_tokenless_" infix: the bare `_v3` / `_neon` / `_wasm128` suffix is
reserved (by the archmage convention used elsewhere in the workspace) for
token-accepting forms. The new infix avoids ambiguity. The archmage Token
types are still used internally by `*_impl_<tier>` dispatcher fns that
take a token from `incant!`, but those stay private.

Bench impact: zero. The `#[rite(<tier>)]` swap and parameter rename are
zero-LOC-delta in lowered code; same SIMD body, same inlining behavior.

Test impact: 36 in-file parity tests (`*_v3_matches_scalar` etc.) and the
tests/asm_inline_check.rs harness updated to use tokenless names. All
216 tests pass on x86_64; aarch64 + wasm32 verified via cross/wasmtime
in subsequent commits.

Tracking: #7

* fix(deinterleave): rename u8/u16 chunk pub-use aliases to _tokenless_v3

Pass-3 of the rename script renamed the source side of the re-export
(`x86::rgb24_chunk8_tokenless_v3`) but missed the alias side, leaving
the external name as `rgb24_chunk8_to_planes_v3` — same as the v0.2.7
token-taking name, so external callers wouldn't find the new tokenless
fn at the documented path. Renamed both u8 and u16 aliases to
`*_tokenless_v3` for consistency with the f32 surface.

Caught by zenanalyze tier1.rs migration test build:
  error[E0425]: cannot find function `rgb24_chunk8_to_planes_tokenless_v3`
  in module `garb::deinterleave`

All 216 garb tests still pass after the rename.

* bench(deinterleave): chunk-size A/B for slice dispatcher pipeline

Adds a `rgb_f32 chunk size choice` group to `benches/deinterleave.rs`
comparing 4 dispatcher variants — chunk16+tail, chunk8+tail, chunk4+tail,
and the current chunk16 → chunk8 → chunk4 → scalar cascade — across
sizes whose tails do and don't divide 16 cleanly.

Findings (7950X, AVX2):

  | size               | chunk16  | chunk8  | chunk4  | cascade |
  | 16px (clean)       | 21.4ns   | 23.6ns  | 21.3ns  | 20.7ns  |
  | 17px (+1 tail)     | 24.1ns   | 22.7ns  | 22.2ns  | 22.3ns  |
  | 31px (+15 tail)    | 31.2ns   | 28.3ns  | 26.2ns  | 25.9ns  |
  | 1024px (clean L1)  | 184.2ns  | 178.5ns | 214.6ns | 197.9ns |
  | 4099px (+3 tail)   | 678.2ns  | 679.9ns | 845.3ns | 738.7ns |
  | 65536px (L2)       | 13.02µs  | 13.16µs | 13.72µs | 12.89µs |

At realistic image-row sizes (1024-4099 px), chunk16+tail ties or beats
the cascade. Cascade only wins at sub-32px awkward-tail sizes that don't
matter in production. chunk4-only loses consistently at 1024+ px.

Action item for the v0.2.8 PR: drop chunk8/chunk4 from the slice
dispatcher pipeline — chunk16 + flat scalar tail across the 12 impl
bodies. Saves ~140 LOC, no perf regression at sizes that matter.

This bench also informally answers the by-value-vs-mut-ref output
question: cargo asm on the current dispatcher shows 14 direct `vmovups`
writes to output slices and 1 stack write (a register spill, not a
chunk-fn local). LLVM has already elided the chunk fn's tuple-return
through inlining; the mut-ref refactor would change zero asm.

Tracking: #7

* refactor(deinterleave): chunk16 + flat scalar tail in slice dispatchers

The 16→8→4→scalar cascade in the 12 f32 slice-dispatcher impl bodies didn't
earn its keep over a simpler chunk16 + scalar-tail pipeline. From the
chunk-size A/B benchmark on this branch (7950X, AVX2):

  | size              | chunk16+tail | cascade 16-8-4 | delta |
  | 16px              |    21.4ns    |     20.7ns     |  -3%  |
  | 17px (+1 tail)    |    24.1ns    |     22.3ns     |  -8%  |
  | 31px (+15 tail)   |    31.2ns    |     25.9ns     | -17%  |
  | 1024px (clean)    |   184.2ns    |    197.9ns     |  +7%  |
  | 4099px (+3 tail)  |   678.2ns    |    738.7ns     |  +9%  |
  | 65536px (L2)      |  13.02µs     |   12.89µs      |  -1%  |

The cascade only wins at sub-32px sizes that aren't representative of
production image-row work (where rows are hundreds to thousands of
pixels). At 1024-4099px (the realistic range), chunk16+tail is 7-9%
faster — its single-loop structure reduces branch overhead and makes the
hot path easier for LLVM to schedule.

Each `*_impl_<tier>` body simplifies from a 4-stage cascade (chunk16,
chunk8, chunk4, scalar) to 1 chunk16 loop + 1 flat scalar tail (up to 15
pixels). The chunk8 / chunk4 public fns stay — `zenanalyze/tier1.rs`
calls `rgb24_chunk8_to_planes_v3` (now `_tokenless_v3`) and
`zenpixels-convert/fast_gamut_v2.rs` calls chunk8 v3 + chunk4 neon/wasm128
directly under their own `#[arcane]` regions. They aren't part of the
slice-dispatcher path.

Touches all 12 dispatcher impls — 4 shapes × 3 tiers (v3/neon/wasm128).
Net diff: 12 fns × ~50 lines → ~25 lines each, cuts roughly 300 LOC.

Test plan:
  - x86_64 native: 216 tests pass
  - aarch64 cross + qemu: 214 tests pass
  - wasm32-wasip1 + wasmtime + simd128: 214 tests pass

Tracking: #7

* bench(deinterleave): autovec(avx2) vs hand-written chunk SIMD

Adds a `rgb_f32 autovec vs chunk SIMD` group to benches/deinterleave.rs
comparing the current chunk-SIMD slice dispatcher path
(`rgb_f32_to_planes_f32` → `*_chunk*_to_planes_tokenless_v3` → 128-bit
`_mm_*` intrinsics) vs LLVM autovec on the same scalar source wrapped
in `#[arcane(v3)]` (which lets LLVM use 256-bit YMM ops under
target_feature avx2,fma).

Findings (7950X, AVX2):

  | size       | scatter chunk-SIMD | scatter autovec  | autovec wins |
  | 64px       |   29.7ns           |    31.0ns        |     -4%      |
  | 256px      |   63.3ns           |    48.8ns        |    +30%      |
  | 1024px     |  178.2ns           |   130.5ns        |    +37%      |
  | 4096px     |  719.1ns           |   703.9ns        |     +2%      |
  | 16384px    | 2691.3ns           |  2857.3ns        |     -6%      |
  | 65536px    | 12.69µs            |  13.03µs         |     -3%      |

  | size       | gather chunk-SIMD  | gather autovec   | autovec wins |
  | 64px       |   29.7ns           |    27.4ns        |     +8%      |
  | 1024px     |  195.8ns           |   155.2ns        |    +26%      |
  | 262144px   |   51.6µs           |    47.8µs        |     +8%      |

At 1024px (the realistic image-row size), autovec wins by 26-37%. At
larger sizes (16K+) they're within ~5%, memory-bandwidth-dominated.
At smaller sizes the picture is mixed but autovec mostly wins.

Root cause: `mod x86_f32_chunks` uses 75 × `_mm_*` (128-bit XMM) ops
and 0 × `_mm256_*` (256-bit YMM) ops. The "v3" naming on the f32
chunks is misleading — it's VEX-encoded SSE-style code, not real
256-bit AVX2.

Direct asm-level confirmation:

  path                                                  YMM uses   XMM uses
  --------------------------------------------------------------------------
  autovec (scalar under #[arcane(v3)])                    20          6
  chunk-SIMD (__arcane_rgb_f32_to_planes_impl_v3)         20         67

LLVM autovec emits 3.3× higher YMM-to-XMM ratio than the hand-written
chunks. The hand-written code can't use 256-bit registers because it
was written with 128-bit intrinsics throughout.

Implication: the hand-written f32 chunk SIMD ships a net-negative
implementation vs autovec at the size range that matters most for
image processing (1024px scatter: autovec is 37% faster). Either:

  A. Drop hand-written f32 chunk SIMD from the slice dispatcher path;
     route through #[arcane(v3)] + scalar source (autovec).
  B. Rewrite chunk SIMD using actual 256-bit `_mm256_*` ops
     (`_mm256_loadu_ps`, `_mm256_permutevar8x32_ps`, ...). Separate
     engineering project.

Public chunk fns (chunk4/chunk8/chunk16) stay regardless for callers
that want to fuse them into their own SIMD loops (`zenanalyze/tier1.rs`
and `zenpixels-convert/fast_gamut_v2.rs`), but they shouldn't be the
path the slice dispatcher takes.

Tracking: #7

* refactor(deinterleave): drop hand-written f32 chunk SIMD; route through autovec

Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written
f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 ×
_mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was
misleading — VEX-encoded SSE-style code, not 256-bit AVX2.

LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the
hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row
size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written
emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the
hand-written code can't because it was written with 128-bit intrinsics
throughout.

Changes:

  - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers
    around the inline scalar loop. LLVM autovec lifts to 256-bit YMM
    under target_feature avx2,fma.
  - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks`
    (~1180 lines). 36 hand-written chunk SIMD primitives removed before
    they ever published.
  - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks.
  - Delete tests/asm_inline_check.rs (the chunks it inline-checked are
    gone).
  - Delete 36 in-file per-arch parity tests in `mod tests`.
  - Strip 2 chunk-related bench groups from benches/deinterleave.rs;
    the chunk-size-choice and autovec-vs-chunk groups served their
    purpose (proved this refactor was right) and are now dead.

What stays:

  - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar`
    and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside
    their own `#[arcane(<tier>)]` region — autovec lifts them to YMM.
  - 4 slice-level f32 dispatchers (the public API).
  - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`,
    `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2
    (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze.

Caller migration:

  zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks):
  rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`.
  Caller is already inside `#[arcane(<tier>)]` so the scalar body
  autovecs to 256-bit YMM.

Verified:
  - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity
    tests + 12 lost from asm_inline_check, vs 35 net counted)
  - aarch64 cross + qemu: 201 tests pass
  - wasm32-wasip1 + wasmtime + simd128: 201 tests pass
  - clippy clean
  - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass

File-level diff:
  src/deinterleave.rs     -1180 lines (chunk modules) +12 dispatcher trims
  tests/asm_inline_check.rs  deleted (-277 lines)
  benches/deinterleave.rs  -451 lines (chunk-size + autovec-vs-chunk groups)
  zenpixels-convert/src/fast_gamut_v2.rs  12 call sites renamed

Tracking: #7

* refactor(deinterleave): use #[autoversion] for f32 dispatchers

Replace the manual `incant!` + `mod x86_f32 / arm_f32 / wasm_f32` +
`pub(crate) fn ..._impl_<tier>` boilerplate with a single
`#[autoversion(v3, neon, wasm128)]` attribute on each f32 public dispatcher.

Before: 4 public fns + 4 `_impl_scalar` + 12 per-arch `_impl_<tier>`
        wrappers across 3 mods + 3 `use` blocks + 4 `incant!` calls
        = ~240 lines of dispatcher boilerplate.

After:  4 public fns each decorated with `#[autoversion(v3, neon, wasm128)]`,
        body is the error checks + a direct call to the inline scalar loop
        helper. archmage's autoversion macro generates the per-tier variants
        + runtime dispatcher.

The autoversion-generated code is asm-identical to the previous manual
wiring on x86_64 (20 YMM uses + 6 XMM in `__arcane_*_v3`, same as before).
Slice dispatchers still autovec to 256-bit YMM under target_feature
avx2,fma — the simplification is purely cosmetic / DRY.

The u8/u16 path (`rgb24_to_planes_f32`, `rgb48_to_planes_f32`) keeps its
manual `incant!` + chunk-SIMD wiring because the hand-written
`_mm_shuffle_epi8` chunk fns beat autovec by 21-52% at L1-L3 sizes
(see `benchmarks/{rgb24,rgb48}_chunk_vs_autovec_2026-05-07`). Different
shape from f32 — `_mm_shuffle_epi8` patterns autovec can't infer +
real 256-bit AVX2 widening.

Saved benches:
  benchmarks/rgb24_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD
    +21% to +52% at all non-memory-bound sizes
  benchmarks/rgb48_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD
    +22% to +49%

src/deinterleave.rs: 1838 → 1587 lines (-251).

Verified:
  - x86_64: 203 tests pass
  - aarch64: 201 tests pass
  - wasm32-wasip1: 201 tests pass
  - clippy clean

Tracking: #7

* fix(deinterleave): drop redundant + mis-cfg-gated archmage imports

The previous commit's import cleanup got mangled by `cargo fmt` reordering:
the `#[cfg(target_arch = "x86_64")]` attribute on `use archmage::X64V3Token`
ended up also gating `use archmage::autoversion`, even though autoversion
is a proc-macro that needs to be available on all archs (it cfg-gates its
generated tier variants internally).

Tests still passed because `archmage::prelude::*` re-exports both
`autoversion` and `incant` — the explicit `use` statements were
redundant. Drop them; rely on `prelude::*`. Keep only the gated
`X64V3Token` import (used by the existing `autovec_avx2_rgb24` u8 helper
in the file body).

Net delete: 2 lines.

* docs(README): update deinterleave section for v0.2.8 surface

The previous section described PR #5's never-shipped chunk-level
`*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32
chunks "hand-tuned AVX2". After bench-driven trimming both claims are
wrong. Replace with the actual v0.2.8 surface:

  - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2
    widening, 21-52% faster than autovec (kept).
  - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM /
    NEON / wasm128 (drop the 128-bit hand-written chunks; they lost
    to autovec by 26-37% at 1024px).
  - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD
    fusion, plus `_scalar` chunks for both u8/u16 and f32 callers
    inside their own #[arcane(<tier>)] region.

Cross-references the saved bench files for anyone wanting to see the
data behind the design choices.
lilith added a commit that referenced this pull request May 8, 2026
The previous entry talked extensively about PR #5's never-published
hand-written f32 chunk SIMD ("Not added (PR #5's...)") and the
incant!→autoversion internal refactor as if either were user-visible
API changes. They aren't — users only see deltas relative to released
versions.

Tightened to:
  - Removed (BREAKING): the 2 token-taking _v3 chunk fns
  - Added: 2 tokenless _v3 replacements + 12 scalar f32 chunk primitives
  - Changed (internal): incant!→autoversion in the existing slice
    dispatchers, no public API change
  - Migration: 1-line caller rename

Bench evidence and PR-#5-supersession history live in
benchmarks/*.meta and the GitHub issue/PR thread; not in CHANGELOG.
@lilith lilith deleted the deinterleave-f32-chunks branch May 9, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant