feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles) by lilith · Pull Request #5 · imazen/garb

lilith · 2026-05-02T23:04:55Z

Summary

This PR (re-titled and expanded) lands two layers of work on src/deinterleave.rs:

(Original PR feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles) #5) 12 new public scalar chunk-level f32 RGB/RGBA deinterleave + interleave functions at chunk widths {4, 8, 16}.
(New) Per-arch SIMD specializations of those chunks — real vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 on AArch64; real vshufps / vinsertps / vblendps / vunpcklps / vmovlhps on x86_64 AVX2; real i8x16.shuffle / v128.load / v128.store on wasm32 SIMD128. The slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 #[arcane] dispatchers (which previously called the scalar loop body verbatim under a target_feature = "avx2" / "neon" region) now route through these chunk SIMD primitives via a 16 → 8 → 4 → scalar tail pipeline.

Public API

Twelve new shape-keys per chunk variant — chunk-{4, 8, 16} × {RGB, RGBA} × {deinterleave, interleave}:

{rgb,rgba}_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128}   (24 deinterleavers)
planes_to_{rgb,rgba}_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128}   (24 interleavers)

The bare-name aliases from the original PR #5 (rgb_f32_chunk4_to_planes etc.) are preserved as thin wrappers around *_scalar so existing callers compile untouched.

cargo semver-checks check-release --all-features reports Summary no semver update required — purely additive.

Codegen verification

#[rite] on every per-arch chunk function so they fuse into the caller's #[arcane] / target_feature region — no call / b instruction at the chunk boundary. Verified by tests/asm_inline_check.rs which dumps a sample_caller_*_chunk16 body and asserts the chunk function's SIMD ops appear inline.

x86_64 AVX2 — `sample_caller_v3_rgb_chunk16` (chunk-16 RGB deinterleave, fully inlined)

asm_inline_check::x86_inline::__arcane_sample_caller_v3_rgb_chunk16:
    vmovups xmm2, xmmword ptr [rsi]
    vmovups xmm3, xmmword ptr [rsi + 16]
    vmovups xmm4, xmmword ptr [rsi + 48]
    vmovups xmm5, xmmword ptr [rsi + 64]
    vinsertps xmm0, xmm3, xmm2, 140
    vshufps xmm0, xmm0, xmmword ptr [rsi + 32], 196
    vblendps xmm1, xmm3, xmm2, 2
    vshufps xmm1, xmm1, xmm1, 241
    vbroadcastss xmm6, dword ptr [rsi + 40]
    vblendps xmm1, xmm1, xmm6, 8
    vshufps xmm3, xmm2, xmm3, 236
    ...  (4× chunk-4 sub-bodies, 56 SIMD ops total)
    vmovups xmmword ptr [rdi], xmm6
    vmovups xmmword ptr [rdi + 16], xmm4
    ...  (12× 16-byte stores)
    ret

56 SIMD ops in the body (vmovups / vshufps / vblendps / vinsertps / vbroadcastss).
0 call instructions to any *_chunk*_v3 symbol — fully inlined.
0 scalar vmovss / movss loads.

aarch64 NEON — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 NEON)

.LBB8_8:
    cmp x8, x1
    b.hi .LBB8_30
    mov x11, x9
    sub x16, x9, #96
    add x8, x8, #48
    ld3 { v0.4s, v1.4s, v2.4s }, [x11], #48
    ld3 { v3.4s, v4.4s, v5.4s }, [x16]
    ld3 { v19.4s, v20.4s, v21.4s }, [x11]
    ld3 { v16.4s, v17.4s, v18.4s }, [x16]
    ...
    stp q3, q16, [x13, #-32]
    stp q0, q19, [x13], #64
    ...
    b.ls .LBB8_8

4× ld3 { v0.4s, v1.4s, v2.4s } per inner-loop iteration (= unrolled 2 × chunk-16) — exactly the hardware structure-load.
0 scalar ldr s0 loads in the inner loop.
The RGBA path lights up ld4 { v0.4s, v1.4s, v2.4s, v3.4s } instead — 8 in the chunk-16 unrolled body.
The interleave path emits st3 / st4 symmetrically (8 each in the unrolled chunk-16 inner body).

wasm32 SIMD128 — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 wasm128)

WAT histogram across the full slice-level body (built with RUSTFLAGS="-C target-feature=+simd128"):

   42  i8x16.shuffle    ← cross-lane permute (i32x4_shuffle macro lowers to i8x16.shuffle)
   33  v128.load
   24  v128.store
    3  f32.load          ← scalar tail (< 4 pixels)
    3  f32.store         ← scalar tail

Test plan

cargo test --release --features experimental (x86_64) — 228 lib tests pass, including 12 byte-exact *_v3 vs *_scalar parity tests for the new SIMD chunks.
cross test --release --features experimental --target aarch64-unknown-linux-gnu (qemu) — 209 lib tests pass, including 12 byte-exact *_neon vs *_scalar parity tests.
RUSTFLAGS="-C target-feature=+simd128" cargo test --target wasm32-wasip1 --features experimental --release (wasmtime) — 209 lib tests pass, including 12 byte-exact *_wasm128 vs *_scalar parity tests.
tests/asm_inline_check.rs — runs as a smoke test on each arch when the host has the required ISA; verifies the #[rite] chunk functions inline cleanly into a sample #[arcane] caller. Passes on x86_64 native + aarch64 cross.
cargo semver-checks check-release --all-features — Summary no semver update required (purely additive).
cargo clippy --release --all-features --lib --tests — clean.

Existing tests retained (round-trip / slice-API-equivalence / order-preservation) — they now exercise the SIMD path on each arch instead of bare scalar.

Notes

No version bump in Cargo.toml; release cadence is the maintainer's call.
The wasm32 slice-level *_impl_wasm128 route is wired up via incant!(...[v3, neon, wasm128, scalar]) for symmetry; existing slice-level callers benefit on wasm32 with +simd128.
The chunk-level _scalar variants stay public so callers in non-SIMD regions (or with their own dispatch) can use them without going through the slice-level dispatcher.
Integrating into zenpixels-convert's narrow body (the original use case) is a separate follow-up.

…leave Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave, interleave} = 12 new public functions. Each body is a single fixed-array literal expression; LLVM auto-vectorizes the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64) shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into shuffles" pattern. No SIMD intrinsics, no unsafe code. Use case: enables fused per-chunk TRC + 3×3 matrix kernels in zenpixels-convert without manual deinterleave/interleave loops.

…shuffles The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were scalar stubs — the v3 / neon variants called the same loop body as scalar, with only autovectorize as the SIMD path. Replace with real hand-rolled per-arch chunk SIMD: - aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware structure-load instructions, one per 4-pixel chunk. Confirmed ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the slice-level rgb_f32_to_planes_f32 and friends. - x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps / vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen. cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss in the inlined chunk-16 path with zero scalar f32 ops. - wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the slice-level body. Each chunk-level SIMD function is #[rite] so it inlines into the caller's #[arcane] / target_feature region with zero call/return overhead — verified by tests/asm_inline_check.rs which asserts the sample_caller_*_chunk16 body contains the chunk function's SIMD ops inline (no call instruction). Public API additions (purely additive — cargo semver-checks reports no semver update required): rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} PR #5's no-suffix names are kept as thin wrappers around *_scalar so existing callers compile untouched. Slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries. Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON, wasm32 SIMD128) plus an asm-inline-check harness that runs as a runtime smoke test under each target. All 209 wasm32-wasip1 lib tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests pass under qemu (cross), all 228 x86_64 lib tests pass natively.

lilith · 2026-05-08T04:21:25Z

Superseded by #8 — same surface area but using tokenless #[rite(<tier>)] per the discussion in #7. Closing this in favor of #8.

…gh autovec Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 × _mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was misleading — VEX-encoded SSE-style code, not 256-bit AVX2. LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the hand-written code can't because it was written with 128-bit intrinsics throughout. Changes: - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers around the inline scalar loop. LLVM autovec lifts to 256-bit YMM under target_feature avx2,fma. - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks` (~1180 lines). 36 hand-written chunk SIMD primitives removed before they ever published. - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks. - Delete tests/asm_inline_check.rs (the chunks it inline-checked are gone). - Delete 36 in-file per-arch parity tests in `mod tests`. - Strip 2 chunk-related bench groups from benches/deinterleave.rs; the chunk-size-choice and autovec-vs-chunk groups served their purpose (proved this refactor was right) and are now dead. What stays: - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar` and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside their own `#[arcane(<tier>)]` region — autovec lifts them to YMM. - 4 slice-level f32 dispatchers (the public API). - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`, `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2 (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze. Caller migration: zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks): rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`. Caller is already inside `#[arcane(<tier>)]` so the scalar body autovecs to 256-bit YMM. Verified: - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity tests + 12 lost from asm_inline_check, vs 35 net counted) - aarch64 cross + qemu: 201 tests pass - wasm32-wasip1 + wasmtime + simd128: 201 tests pass - clippy clean - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass File-level diff: src/deinterleave.rs -1180 lines (chunk modules) +12 dispatcher trims tests/asm_inline_check.rs deleted (-277 lines) benches/deinterleave.rs -451 lines (chunk-size + autovec-vs-chunk groups) zenpixels-convert/src/fast_gamut_v2.rs 12 call sites renamed Tracking: #7

The previous section described PR #5's never-shipped chunk-level `*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32 chunks "hand-tuned AVX2". After bench-driven trimming both claims are wrong. Replace with the actual v0.2.8 surface: - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2 widening, 21-52% faster than autovec (kept). - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM / NEON / wasm128 (drop the 128-bit hand-written chunks; they lost to autovec by 26-37% at 1024px). - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD fusion, plus `_scalar` chunks for both u8/u16 and f32 callers inside their own #[arcane(<tier>)] region. Cross-references the saved bench files for anyone wanting to see the data behind the design choices.

…-break v0.2.8 (#8) * feat(deinterleave): add chunk-level f32 RGB/RGBA deinterleave + interleave Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave, interleave} = 12 new public functions. Each body is a single fixed-array literal expression; LLVM auto-vectorizes the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64) shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into shuffles" pattern. No SIMD intrinsics, no unsafe code. Use case: enables fused per-chunk TRC + 3×3 matrix kernels in zenpixels-convert without manual deinterleave/interleave loops. * feat(deinterleave): real SIMD f32 RGB/RGBA via NEON ld3q + AVX2/WASM shuffles The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were scalar stubs — the v3 / neon variants called the same loop body as scalar, with only autovectorize as the SIMD path. Replace with real hand-rolled per-arch chunk SIMD: - aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware structure-load instructions, one per 4-pixel chunk. Confirmed ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the slice-level rgb_f32_to_planes_f32 and friends. - x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps / vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen. cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss in the inlined chunk-16 path with zero scalar f32 ops. - wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the slice-level body. Each chunk-level SIMD function is #[rite] so it inlines into the caller's #[arcane] / target_feature region with zero call/return overhead — verified by tests/asm_inline_check.rs which asserts the sample_caller_*_chunk16 body contains the chunk function's SIMD ops inline (no call instruction). Public API additions (purely additive — cargo semver-checks reports no semver update required): rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} PR #5's no-suffix names are kept as thin wrappers around *_scalar so existing callers compile untouched. Slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries. Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON, wasm32 SIMD128) plus an asm-inline-check harness that runs as a runtime smoke test under each target. All 209 wasm32-wasip1 lib tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests pass under qemu (cross), all 228 x86_64 lib tests pass natively. * wip(deinterleave): partial rename of NEON chunk fns to _tokenless_<tier> Found in `~/work/garb--cold-pr5` worktree on `feat/tokenless-rite-deinterleave` branch as uncommitted edits when picking up to draft v0.2.8. The diff renames 7 NEON chunk fns (chunk4 × 4 + chunk8 × 3) from `_neon(_t: NeonToken, ...)` form to `#[rite(neon)] ..._tokenless_neon(...)` form — exactly the convention the v0.2.8 tracking issue (#7) prescribes for moving archmage Token types out of garb's public signatures. Preserving as-is per CLAUDE.md "NEVER DESTROY UNCOMMITTED WORK". Subsequent commits on this branch finish the rename across the remaining v3 / wasm128 / chunk16 / NEON-chunk8-RGBA-interleave fns plus the old u8/u16 _v3 chunks. Note: this partial state does NOT compile by itself — the internal callers in `arm_f32::*_impl_neon` still call the old names. Subsequent commits fix that as part of finishing the rename. * refactor(deinterleave): drop 12 bare-name f32 chunk aliases PR #5 shipped one-line forwarders rgb_f32_chunk{4,8,16}_to_planes etc. that delegated to the matching _scalar variant. They had zero callers anywhere in the zen workspace (verified via rg) and the naming misled readers into thinking they were runtime dispatchers when they were just scalar thunks. Drop them. Tests that called the bare names now call _scalar directly. The section's leading comment block updated to document the new convention (_scalar + _tokenless_<tier>) and explain why the bare _v3/_neon/_wasm128 suffix is reserved for the token-accepting form. Tracking: #7 * feat(deinterleave): drop archmage Token types from public API (BREAKING) Replace 36 token-taking f32 chunk fns + 2 u8/u16 token-taking chunks with tokenless `#[rite(<tier>)]` equivalents named `_tokenless_<tier>`. Garb's public surface is now archmage-token-free. Public API change in `garb::deinterleave::`: REMOVED: rgb24_chunk8_to_planes_v3(_t: X64V3Token, ...) rgb48_chunk8_to_planes_v3(_t: X64V3Token, ...) REPLACED WITH: rgb24_chunk8_to_planes_tokenless_v3(...) #[rite(v3)] rgb48_chunk8_to_planes_tokenless_v3(...) #[rite(v3)] ADDED (PR #5 surface, no archmage types): {rgb,rgba}_f32_chunk{4,8,16}_to_planes_tokenless_{v3,neon,wasm128} planes_to_{rgb,rgba}_f32_chunk{4,8,16}_tokenless_{v3,neon,wasm128} {rgb,rgba}_f32_chunk{4,8,16}_to_planes_scalar planes_to_{rgb,rgba}_f32_chunk{4,8,16}_scalar {rgb,rgba}_f32_to_planes_f32 / planes_f32_to_{rgb,rgba}_f32 (slice-level) Why tokenless: archmage's tier-based `#[rite(<tier>)]` form adds `#[target_feature]` + `#[inline]` + `#[cfg(target_arch)]` automatically. Functions decorated this way are safe to call from any matching `#[arcane]` or `#[rite]` region (Rust 1.86+ relaxation), inline cleanly into the caller's SIMD body, and have no archmage type in their signature. This decouples garb's semver from archmage's. Today's downstream callers (zenanalyze, zenpixels-convert/fast-gamut-refactor, zenpipe--garb-deinterleave/ zenfilters) all already directly depend on archmage themselves, so the removal of tokens from garb's signatures imposes zero new import burden — they just stop passing the token they already have. Why "_tokenless_" infix: the bare `_v3` / `_neon` / `_wasm128` suffix is reserved (by the archmage convention used elsewhere in the workspace) for token-accepting forms. The new infix avoids ambiguity. The archmage Token types are still used internally by `*_impl_<tier>` dispatcher fns that take a token from `incant!`, but those stay private. Bench impact: zero. The `#[rite(<tier>)]` swap and parameter rename are zero-LOC-delta in lowered code; same SIMD body, same inlining behavior. Test impact: 36 in-file parity tests (`*_v3_matches_scalar` etc.) and the tests/asm_inline_check.rs harness updated to use tokenless names. All 216 tests pass on x86_64; aarch64 + wasm32 verified via cross/wasmtime in subsequent commits. Tracking: #7 * fix(deinterleave): rename u8/u16 chunk pub-use aliases to _tokenless_v3 Pass-3 of the rename script renamed the source side of the re-export (`x86::rgb24_chunk8_tokenless_v3`) but missed the alias side, leaving the external name as `rgb24_chunk8_to_planes_v3` — same as the v0.2.7 token-taking name, so external callers wouldn't find the new tokenless fn at the documented path. Renamed both u8 and u16 aliases to `*_tokenless_v3` for consistency with the f32 surface. Caught by zenanalyze tier1.rs migration test build: error[E0425]: cannot find function `rgb24_chunk8_to_planes_tokenless_v3` in module `garb::deinterleave` All 216 garb tests still pass after the rename. * bench(deinterleave): chunk-size A/B for slice dispatcher pipeline Adds a `rgb_f32 chunk size choice` group to `benches/deinterleave.rs` comparing 4 dispatcher variants — chunk16+tail, chunk8+tail, chunk4+tail, and the current chunk16 → chunk8 → chunk4 → scalar cascade — across sizes whose tails do and don't divide 16 cleanly. Findings (7950X, AVX2): | size | chunk16 | chunk8 | chunk4 | cascade | | 16px (clean) | 21.4ns | 23.6ns | 21.3ns | 20.7ns | | 17px (+1 tail) | 24.1ns | 22.7ns | 22.2ns | 22.3ns | | 31px (+15 tail) | 31.2ns | 28.3ns | 26.2ns | 25.9ns | | 1024px (clean L1) | 184.2ns | 178.5ns | 214.6ns | 197.9ns | | 4099px (+3 tail) | 678.2ns | 679.9ns | 845.3ns | 738.7ns | | 65536px (L2) | 13.02µs | 13.16µs | 13.72µs | 12.89µs | At realistic image-row sizes (1024-4099 px), chunk16+tail ties or beats the cascade. Cascade only wins at sub-32px awkward-tail sizes that don't matter in production. chunk4-only loses consistently at 1024+ px. Action item for the v0.2.8 PR: drop chunk8/chunk4 from the slice dispatcher pipeline — chunk16 + flat scalar tail across the 12 impl bodies. Saves ~140 LOC, no perf regression at sizes that matter. This bench also informally answers the by-value-vs-mut-ref output question: cargo asm on the current dispatcher shows 14 direct `vmovups` writes to output slices and 1 stack write (a register spill, not a chunk-fn local). LLVM has already elided the chunk fn's tuple-return through inlining; the mut-ref refactor would change zero asm. Tracking: #7 * refactor(deinterleave): chunk16 + flat scalar tail in slice dispatchers The 16→8→4→scalar cascade in the 12 f32 slice-dispatcher impl bodies didn't earn its keep over a simpler chunk16 + scalar-tail pipeline. From the chunk-size A/B benchmark on this branch (7950X, AVX2): | size | chunk16+tail | cascade 16-8-4 | delta | | 16px | 21.4ns | 20.7ns | -3% | | 17px (+1 tail) | 24.1ns | 22.3ns | -8% | | 31px (+15 tail) | 31.2ns | 25.9ns | -17% | | 1024px (clean) | 184.2ns | 197.9ns | +7% | | 4099px (+3 tail) | 678.2ns | 738.7ns | +9% | | 65536px (L2) | 13.02µs | 12.89µs | -1% | The cascade only wins at sub-32px sizes that aren't representative of production image-row work (where rows are hundreds to thousands of pixels). At 1024-4099px (the realistic range), chunk16+tail is 7-9% faster — its single-loop structure reduces branch overhead and makes the hot path easier for LLVM to schedule. Each `*_impl_<tier>` body simplifies from a 4-stage cascade (chunk16, chunk8, chunk4, scalar) to 1 chunk16 loop + 1 flat scalar tail (up to 15 pixels). The chunk8 / chunk4 public fns stay — `zenanalyze/tier1.rs` calls `rgb24_chunk8_to_planes_v3` (now `_tokenless_v3`) and `zenpixels-convert/fast_gamut_v2.rs` calls chunk8 v3 + chunk4 neon/wasm128 directly under their own `#[arcane]` regions. They aren't part of the slice-dispatcher path. Touches all 12 dispatcher impls — 4 shapes × 3 tiers (v3/neon/wasm128). Net diff: 12 fns × ~50 lines → ~25 lines each, cuts roughly 300 LOC. Test plan: - x86_64 native: 216 tests pass - aarch64 cross + qemu: 214 tests pass - wasm32-wasip1 + wasmtime + simd128: 214 tests pass Tracking: #7 * bench(deinterleave): autovec(avx2) vs hand-written chunk SIMD Adds a `rgb_f32 autovec vs chunk SIMD` group to benches/deinterleave.rs comparing the current chunk-SIMD slice dispatcher path (`rgb_f32_to_planes_f32` → `*_chunk*_to_planes_tokenless_v3` → 128-bit `_mm_*` intrinsics) vs LLVM autovec on the same scalar source wrapped in `#[arcane(v3)]` (which lets LLVM use 256-bit YMM ops under target_feature avx2,fma). Findings (7950X, AVX2): | size | scatter chunk-SIMD | scatter autovec | autovec wins | | 64px | 29.7ns | 31.0ns | -4% | | 256px | 63.3ns | 48.8ns | +30% | | 1024px | 178.2ns | 130.5ns | +37% | | 4096px | 719.1ns | 703.9ns | +2% | | 16384px | 2691.3ns | 2857.3ns | -6% | | 65536px | 12.69µs | 13.03µs | -3% | | size | gather chunk-SIMD | gather autovec | autovec wins | | 64px | 29.7ns | 27.4ns | +8% | | 1024px | 195.8ns | 155.2ns | +26% | | 262144px | 51.6µs | 47.8µs | +8% | At 1024px (the realistic image-row size), autovec wins by 26-37%. At larger sizes (16K+) they're within ~5%, memory-bandwidth-dominated. At smaller sizes the picture is mixed but autovec mostly wins. Root cause: `mod x86_f32_chunks` uses 75 × `_mm_*` (128-bit XMM) ops and 0 × `_mm256_*` (256-bit YMM) ops. The "v3" naming on the f32 chunks is misleading — it's VEX-encoded SSE-style code, not real 256-bit AVX2. Direct asm-level confirmation: path YMM uses XMM uses -------------------------------------------------------------------------- autovec (scalar under #[arcane(v3)]) 20 6 chunk-SIMD (__arcane_rgb_f32_to_planes_impl_v3) 20 67 LLVM autovec emits 3.3× higher YMM-to-XMM ratio than the hand-written chunks. The hand-written code can't use 256-bit registers because it was written with 128-bit intrinsics throughout. Implication: the hand-written f32 chunk SIMD ships a net-negative implementation vs autovec at the size range that matters most for image processing (1024px scatter: autovec is 37% faster). Either: A. Drop hand-written f32 chunk SIMD from the slice dispatcher path; route through #[arcane(v3)] + scalar source (autovec). B. Rewrite chunk SIMD using actual 256-bit `_mm256_*` ops (`_mm256_loadu_ps`, `_mm256_permutevar8x32_ps`, ...). Separate engineering project. Public chunk fns (chunk4/chunk8/chunk16) stay regardless for callers that want to fuse them into their own SIMD loops (`zenanalyze/tier1.rs` and `zenpixels-convert/fast_gamut_v2.rs`), but they shouldn't be the path the slice dispatcher takes. Tracking: #7 * refactor(deinterleave): drop hand-written f32 chunk SIMD; route through autovec Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 × _mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was misleading — VEX-encoded SSE-style code, not 256-bit AVX2. LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the hand-written code can't because it was written with 128-bit intrinsics throughout. Changes: - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers around the inline scalar loop. LLVM autovec lifts to 256-bit YMM under target_feature avx2,fma. - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks` (~1180 lines). 36 hand-written chunk SIMD primitives removed before they ever published. - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks. - Delete tests/asm_inline_check.rs (the chunks it inline-checked are gone). - Delete 36 in-file per-arch parity tests in `mod tests`. - Strip 2 chunk-related bench groups from benches/deinterleave.rs; the chunk-size-choice and autovec-vs-chunk groups served their purpose (proved this refactor was right) and are now dead. What stays: - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar` and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside their own `#[arcane(<tier>)]` region — autovec lifts them to YMM. - 4 slice-level f32 dispatchers (the public API). - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`, `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2 (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze. Caller migration: zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks): rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`. Caller is already inside `#[arcane(<tier>)]` so the scalar body autovecs to 256-bit YMM. Verified: - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity tests + 12 lost from asm_inline_check, vs 35 net counted) - aarch64 cross + qemu: 201 tests pass - wasm32-wasip1 + wasmtime + simd128: 201 tests pass - clippy clean - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass File-level diff: src/deinterleave.rs -1180 lines (chunk modules) +12 dispatcher trims tests/asm_inline_check.rs deleted (-277 lines) benches/deinterleave.rs -451 lines (chunk-size + autovec-vs-chunk groups) zenpixels-convert/src/fast_gamut_v2.rs 12 call sites renamed Tracking: #7 * refactor(deinterleave): use #[autoversion] for f32 dispatchers Replace the manual `incant!` + `mod x86_f32 / arm_f32 / wasm_f32` + `pub(crate) fn ..._impl_<tier>` boilerplate with a single `#[autoversion(v3, neon, wasm128)]` attribute on each f32 public dispatcher. Before: 4 public fns + 4 `_impl_scalar` + 12 per-arch `_impl_<tier>` wrappers across 3 mods + 3 `use` blocks + 4 `incant!` calls = ~240 lines of dispatcher boilerplate. After: 4 public fns each decorated with `#[autoversion(v3, neon, wasm128)]`, body is the error checks + a direct call to the inline scalar loop helper. archmage's autoversion macro generates the per-tier variants + runtime dispatcher. The autoversion-generated code is asm-identical to the previous manual wiring on x86_64 (20 YMM uses + 6 XMM in `__arcane_*_v3`, same as before). Slice dispatchers still autovec to 256-bit YMM under target_feature avx2,fma — the simplification is purely cosmetic / DRY. The u8/u16 path (`rgb24_to_planes_f32`, `rgb48_to_planes_f32`) keeps its manual `incant!` + chunk-SIMD wiring because the hand-written `_mm_shuffle_epi8` chunk fns beat autovec by 21-52% at L1-L3 sizes (see `benchmarks/{rgb24,rgb48}_chunk_vs_autovec_2026-05-07`). Different shape from f32 — `_mm_shuffle_epi8` patterns autovec can't infer + real 256-bit AVX2 widening. Saved benches: benchmarks/rgb24_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD +21% to +52% at all non-memory-bound sizes benchmarks/rgb48_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD +22% to +49% src/deinterleave.rs: 1838 → 1587 lines (-251). Verified: - x86_64: 203 tests pass - aarch64: 201 tests pass - wasm32-wasip1: 201 tests pass - clippy clean Tracking: #7 * fix(deinterleave): drop redundant + mis-cfg-gated archmage imports The previous commit's import cleanup got mangled by `cargo fmt` reordering: the `#[cfg(target_arch = "x86_64")]` attribute on `use archmage::X64V3Token` ended up also gating `use archmage::autoversion`, even though autoversion is a proc-macro that needs to be available on all archs (it cfg-gates its generated tier variants internally). Tests still passed because `archmage::prelude::*` re-exports both `autoversion` and `incant` — the explicit `use` statements were redundant. Drop them; rely on `prelude::*`. Keep only the gated `X64V3Token` import (used by the existing `autovec_avx2_rgb24` u8 helper in the file body). Net delete: 2 lines. * docs(README): update deinterleave section for v0.2.8 surface The previous section described PR #5's never-shipped chunk-level `*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32 chunks "hand-tuned AVX2". After bench-driven trimming both claims are wrong. Replace with the actual v0.2.8 surface: - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2 widening, 21-52% faster than autovec (kept). - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM / NEON / wasm128 (drop the 128-bit hand-written chunks; they lost to autovec by 26-37% at 1024px). - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD fusion, plus `_scalar` chunks for both u8/u16 and f32 callers inside their own #[arcane(<tier>)] region. Cross-references the saved bench files for anyone wanting to see the data behind the design choices.

The previous entry talked extensively about PR #5's never-published hand-written f32 chunk SIMD ("Not added (PR #5's...)") and the incant!→autoversion internal refactor as if either were user-visible API changes. They aren't — users only see deltas relative to released versions. Tightened to: - Removed (BREAKING): the 2 token-taking _v3 chunk fns - Added: 2 tokenless _v3 replacements + 12 scalar f32 chunk primitives - Changed (internal): incant!→autoversion in the existing slice dispatchers, no public API change - Migration: 1-line caller rename Bench evidence and PR-#5-supersession history live in benchmarks/*.meta and the GitHub issue/PR thread; not in CHANGELOG.

lilith self-assigned this May 2, 2026

lilith changed the title ~~feat(deinterleave): add chunk-level f32 RGB/RGBA deinterleave + interleave~~ feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles) May 3, 2026

lilith force-pushed the deinterleave-f32-chunks branch from 8e650b4 to 9f359a8 Compare May 3, 2026 00:20

This was referenced May 8, 2026

deinterleave: drop archmage types from public API; supersede #5 with tokenless #[rite] chunks #7

Closed

deinterleave: tokenless #[rite(<tier>)] chunks; supersedes #5; semver-break v0.2.8 #8

Merged

lilith closed this May 8, 2026

lilith deleted the deinterleave-f32-chunks branch May 9, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5

feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5
lilith wants to merge 2 commits into
mainfrom
deinterleave-f32-chunks

lilith commented May 2, 2026 •

edited

Loading

Uh oh!

lilith commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilith commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public API

Codegen verification

Test plan

Notes

Uh oh!

lilith commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lilith commented May 2, 2026 •

edited

Loading