feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5
Closed
lilith wants to merge 2 commits into
Closed
feat(deinterleave): real SIMD f32 RGB/RGBA chunks (NEON ld3q + AVX2 / wasm128 shuffles)#5lilith wants to merge 2 commits into
lilith wants to merge 2 commits into
Conversation
…leave
Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input
and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave,
interleave} = 12 new public functions.
Each body is a single fixed-array literal expression; LLVM auto-vectorizes
the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64)
shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into
shuffles" pattern. No SIMD intrinsics, no unsafe code.
Use case: enables fused per-chunk TRC + 3×3 matrix kernels in
zenpixels-convert without manual deinterleave/interleave loops.
lilith
added a commit
that referenced
this pull request
May 3, 2026
…shuffles The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were scalar stubs — the v3 / neon variants called the same loop body as scalar, with only autovectorize as the SIMD path. Replace with real hand-rolled per-arch chunk SIMD: - aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware structure-load instructions, one per 4-pixel chunk. Confirmed ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the slice-level rgb_f32_to_planes_f32 and friends. - x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps / vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen. cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss in the inlined chunk-16 path with zero scalar f32 ops. - wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the slice-level body. Each chunk-level SIMD function is #[rite] so it inlines into the caller's #[arcane] / target_feature region with zero call/return overhead — verified by tests/asm_inline_check.rs which asserts the sample_caller_*_chunk16 body contains the chunk function's SIMD ops inline (no call instruction). Public API additions (purely additive — cargo semver-checks reports no semver update required): rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} PR #5's no-suffix names are kept as thin wrappers around *_scalar so existing callers compile untouched. Slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries. Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON, wasm32 SIMD128) plus an asm-inline-check harness that runs as a runtime smoke test under each target. All 209 wasm32-wasip1 lib tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests pass under qemu (cross), all 228 x86_64 lib tests pass natively.
…shuffles The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were scalar stubs — the v3 / neon variants called the same loop body as scalar, with only autovectorize as the SIMD path. Replace with real hand-rolled per-arch chunk SIMD: - aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware structure-load instructions, one per 4-pixel chunk. Confirmed ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the slice-level rgb_f32_to_planes_f32 and friends. - x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps / vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen. cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss in the inlined chunk-16 path with zero scalar f32 ops. - wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the slice-level body. Each chunk-level SIMD function is #[rite] so it inlines into the caller's #[arcane] / target_feature region with zero call/return overhead — verified by tests/asm_inline_check.rs which asserts the sample_caller_*_chunk16 body contains the chunk function's SIMD ops inline (no call instruction). Public API additions (purely additive — cargo semver-checks reports no semver update required): rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} PR #5's no-suffix names are kept as thin wrappers around *_scalar so existing callers compile untouched. Slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries. Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON, wasm32 SIMD128) plus an asm-inline-check harness that runs as a runtime smoke test under each target. All 209 wasm32-wasip1 lib tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests pass under qemu (cross), all 228 x86_64 lib tests pass natively.
8e650b4 to
9f359a8
Compare
This was referenced May 8, 2026
Closed
Member
Author
lilith
added a commit
that referenced
this pull request
May 8, 2026
…gh autovec Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 × _mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was misleading — VEX-encoded SSE-style code, not 256-bit AVX2. LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the hand-written code can't because it was written with 128-bit intrinsics throughout. Changes: - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers around the inline scalar loop. LLVM autovec lifts to 256-bit YMM under target_feature avx2,fma. - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks` (~1180 lines). 36 hand-written chunk SIMD primitives removed before they ever published. - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks. - Delete tests/asm_inline_check.rs (the chunks it inline-checked are gone). - Delete 36 in-file per-arch parity tests in `mod tests`. - Strip 2 chunk-related bench groups from benches/deinterleave.rs; the chunk-size-choice and autovec-vs-chunk groups served their purpose (proved this refactor was right) and are now dead. What stays: - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar` and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside their own `#[arcane(<tier>)]` region — autovec lifts them to YMM. - 4 slice-level f32 dispatchers (the public API). - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`, `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2 (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze. Caller migration: zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks): rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`. Caller is already inside `#[arcane(<tier>)]` so the scalar body autovecs to 256-bit YMM. Verified: - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity tests + 12 lost from asm_inline_check, vs 35 net counted) - aarch64 cross + qemu: 201 tests pass - wasm32-wasip1 + wasmtime + simd128: 201 tests pass - clippy clean - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass File-level diff: src/deinterleave.rs -1180 lines (chunk modules) +12 dispatcher trims tests/asm_inline_check.rs deleted (-277 lines) benches/deinterleave.rs -451 lines (chunk-size + autovec-vs-chunk groups) zenpixels-convert/src/fast_gamut_v2.rs 12 call sites renamed Tracking: #7
lilith
added a commit
that referenced
this pull request
May 8, 2026
The previous section described PR #5's never-shipped chunk-level `*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32 chunks "hand-tuned AVX2". After bench-driven trimming both claims are wrong. Replace with the actual v0.2.8 surface: - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2 widening, 21-52% faster than autovec (kept). - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM / NEON / wasm128 (drop the 128-bit hand-written chunks; they lost to autovec by 26-37% at 1024px). - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD fusion, plus `_scalar` chunks for both u8/u16 and f32 callers inside their own #[arcane(<tier>)] region. Cross-references the saved bench files for anyone wanting to see the data behind the design choices.
lilith
added a commit
that referenced
this pull request
May 8, 2026
…-break v0.2.8 (#8) * feat(deinterleave): add chunk-level f32 RGB/RGBA deinterleave + interleave Mirrors the existing rgb24_chunk8_to_planes_scalar pattern at f32 input and across {4, 8, 16}-pixel chunk widths × {RGB, RGBA} × {deinterleave, interleave} = 12 new public functions. Each body is a single fixed-array literal expression; LLVM auto-vectorizes the contiguous loads into vinsertps/vshufps (x86) or tbl (AArch64) shuffles per the "Fixed-array scalar loads/stores can auto-vectorize into shuffles" pattern. No SIMD intrinsics, no unsafe code. Use case: enables fused per-chunk TRC + 3×3 matrix kernels in zenpixels-convert without manual deinterleave/interleave loops. * feat(deinterleave): real SIMD f32 RGB/RGBA via NEON ld3q + AVX2/WASM shuffles The 8 #[arcane] f32 RGB/RGBA slice-level dispatchers in PR #5 were scalar stubs — the v3 / neon variants called the same loop body as scalar, with only autovectorize as the SIMD path. Replace with real hand-rolled per-arch chunk SIMD: - aarch64 NEON: vld3q_f32 / vld4q_f32 / vst3q_f32 / vst4q_f32 hardware structure-load instructions, one per 4-pixel chunk. Confirmed ld3 { v0.4s, v1.4s, v2.4s } / st3 / st4 in cargo asm output for the slice-level rgb_f32_to_planes_f32 and friends. - x86_64 AVX2: 128-bit shuffles using vshufps / vinsertps / vblendps / vbroadcastss / vpermilps for chunk-4 RGB; _MM_TRANSPOSE4_PS via vunpcklps / vunpckhps / vmovlhps / vmovhlps for chunk-4 RGBA. Larger chunk sizes (8, 16) compose chunk-4 calls for clean AVX2 codegen. cargo asm shows 56+ vshufps / vblendps / vinsertps / vbroadcastss in the inlined chunk-16 path with zero scalar f32 ops. - wasm32 SIMD128: i8x16.shuffle (the wasm i32x4_shuffle macro lowers to i8x16.shuffle) for cross-lane 5-shuffle deinterleave. cargo asm shows 42 i8x16.shuffle / 33 v128.load / 24 v128.store in the slice-level body. Each chunk-level SIMD function is #[rite] so it inlines into the caller's #[arcane] / target_feature region with zero call/return overhead — verified by tests/asm_inline_check.rs which asserts the sample_caller_*_chunk16 body contains the chunk function's SIMD ops inline (no call instruction). Public API additions (purely additive — cargo semver-checks reports no semver update required): rgb_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} rgba_f32_chunk{4,8,16}_to_planes_{scalar,v3,neon,wasm128} planes_to_rgb_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} planes_to_rgba_f32_chunk{4,8,16}_{scalar,v3,neon,wasm128} PR #5's no-suffix names are kept as thin wrappers around *_scalar so existing callers compile untouched. Slice-level rgb_f32_to_planes_f32 / rgba_f32_to_planes_f32 / planes_f32_to_rgb_f32 / planes_f32_to_rgba_f32 now route through the chunk-level SIMD via a 16 -> 8 -> 4 -> scalar tail loop in the impl_v3 / impl_neon / impl_wasm128 #[arcane] entries. Tests: 12 byte-exact parity tests per arch (x86_64 V3, aarch64 NEON, wasm32 SIMD128) plus an asm-inline-check harness that runs as a runtime smoke test under each target. All 209 wasm32-wasip1 lib tests pass under wasmtime, all 209 aarch64-unknown-linux-gnu lib tests pass under qemu (cross), all 228 x86_64 lib tests pass natively. * wip(deinterleave): partial rename of NEON chunk fns to _tokenless_<tier> Found in `~/work/garb--cold-pr5` worktree on `feat/tokenless-rite-deinterleave` branch as uncommitted edits when picking up to draft v0.2.8. The diff renames 7 NEON chunk fns (chunk4 × 4 + chunk8 × 3) from `_neon(_t: NeonToken, ...)` form to `#[rite(neon)] ..._tokenless_neon(...)` form — exactly the convention the v0.2.8 tracking issue (#7) prescribes for moving archmage Token types out of garb's public signatures. Preserving as-is per CLAUDE.md "NEVER DESTROY UNCOMMITTED WORK". Subsequent commits on this branch finish the rename across the remaining v3 / wasm128 / chunk16 / NEON-chunk8-RGBA-interleave fns plus the old u8/u16 _v3 chunks. Note: this partial state does NOT compile by itself — the internal callers in `arm_f32::*_impl_neon` still call the old names. Subsequent commits fix that as part of finishing the rename. * refactor(deinterleave): drop 12 bare-name f32 chunk aliases PR #5 shipped one-line forwarders rgb_f32_chunk{4,8,16}_to_planes etc. that delegated to the matching _scalar variant. They had zero callers anywhere in the zen workspace (verified via rg) and the naming misled readers into thinking they were runtime dispatchers when they were just scalar thunks. Drop them. Tests that called the bare names now call _scalar directly. The section's leading comment block updated to document the new convention (_scalar + _tokenless_<tier>) and explain why the bare _v3/_neon/_wasm128 suffix is reserved for the token-accepting form. Tracking: #7 * feat(deinterleave): drop archmage Token types from public API (BREAKING) Replace 36 token-taking f32 chunk fns + 2 u8/u16 token-taking chunks with tokenless `#[rite(<tier>)]` equivalents named `_tokenless_<tier>`. Garb's public surface is now archmage-token-free. Public API change in `garb::deinterleave::`: REMOVED: rgb24_chunk8_to_planes_v3(_t: X64V3Token, ...) rgb48_chunk8_to_planes_v3(_t: X64V3Token, ...) REPLACED WITH: rgb24_chunk8_to_planes_tokenless_v3(...) #[rite(v3)] rgb48_chunk8_to_planes_tokenless_v3(...) #[rite(v3)] ADDED (PR #5 surface, no archmage types): {rgb,rgba}_f32_chunk{4,8,16}_to_planes_tokenless_{v3,neon,wasm128} planes_to_{rgb,rgba}_f32_chunk{4,8,16}_tokenless_{v3,neon,wasm128} {rgb,rgba}_f32_chunk{4,8,16}_to_planes_scalar planes_to_{rgb,rgba}_f32_chunk{4,8,16}_scalar {rgb,rgba}_f32_to_planes_f32 / planes_f32_to_{rgb,rgba}_f32 (slice-level) Why tokenless: archmage's tier-based `#[rite(<tier>)]` form adds `#[target_feature]` + `#[inline]` + `#[cfg(target_arch)]` automatically. Functions decorated this way are safe to call from any matching `#[arcane]` or `#[rite]` region (Rust 1.86+ relaxation), inline cleanly into the caller's SIMD body, and have no archmage type in their signature. This decouples garb's semver from archmage's. Today's downstream callers (zenanalyze, zenpixels-convert/fast-gamut-refactor, zenpipe--garb-deinterleave/ zenfilters) all already directly depend on archmage themselves, so the removal of tokens from garb's signatures imposes zero new import burden — they just stop passing the token they already have. Why "_tokenless_" infix: the bare `_v3` / `_neon` / `_wasm128` suffix is reserved (by the archmage convention used elsewhere in the workspace) for token-accepting forms. The new infix avoids ambiguity. The archmage Token types are still used internally by `*_impl_<tier>` dispatcher fns that take a token from `incant!`, but those stay private. Bench impact: zero. The `#[rite(<tier>)]` swap and parameter rename are zero-LOC-delta in lowered code; same SIMD body, same inlining behavior. Test impact: 36 in-file parity tests (`*_v3_matches_scalar` etc.) and the tests/asm_inline_check.rs harness updated to use tokenless names. All 216 tests pass on x86_64; aarch64 + wasm32 verified via cross/wasmtime in subsequent commits. Tracking: #7 * fix(deinterleave): rename u8/u16 chunk pub-use aliases to _tokenless_v3 Pass-3 of the rename script renamed the source side of the re-export (`x86::rgb24_chunk8_tokenless_v3`) but missed the alias side, leaving the external name as `rgb24_chunk8_to_planes_v3` — same as the v0.2.7 token-taking name, so external callers wouldn't find the new tokenless fn at the documented path. Renamed both u8 and u16 aliases to `*_tokenless_v3` for consistency with the f32 surface. Caught by zenanalyze tier1.rs migration test build: error[E0425]: cannot find function `rgb24_chunk8_to_planes_tokenless_v3` in module `garb::deinterleave` All 216 garb tests still pass after the rename. * bench(deinterleave): chunk-size A/B for slice dispatcher pipeline Adds a `rgb_f32 chunk size choice` group to `benches/deinterleave.rs` comparing 4 dispatcher variants — chunk16+tail, chunk8+tail, chunk4+tail, and the current chunk16 → chunk8 → chunk4 → scalar cascade — across sizes whose tails do and don't divide 16 cleanly. Findings (7950X, AVX2): | size | chunk16 | chunk8 | chunk4 | cascade | | 16px (clean) | 21.4ns | 23.6ns | 21.3ns | 20.7ns | | 17px (+1 tail) | 24.1ns | 22.7ns | 22.2ns | 22.3ns | | 31px (+15 tail) | 31.2ns | 28.3ns | 26.2ns | 25.9ns | | 1024px (clean L1) | 184.2ns | 178.5ns | 214.6ns | 197.9ns | | 4099px (+3 tail) | 678.2ns | 679.9ns | 845.3ns | 738.7ns | | 65536px (L2) | 13.02µs | 13.16µs | 13.72µs | 12.89µs | At realistic image-row sizes (1024-4099 px), chunk16+tail ties or beats the cascade. Cascade only wins at sub-32px awkward-tail sizes that don't matter in production. chunk4-only loses consistently at 1024+ px. Action item for the v0.2.8 PR: drop chunk8/chunk4 from the slice dispatcher pipeline — chunk16 + flat scalar tail across the 12 impl bodies. Saves ~140 LOC, no perf regression at sizes that matter. This bench also informally answers the by-value-vs-mut-ref output question: cargo asm on the current dispatcher shows 14 direct `vmovups` writes to output slices and 1 stack write (a register spill, not a chunk-fn local). LLVM has already elided the chunk fn's tuple-return through inlining; the mut-ref refactor would change zero asm. Tracking: #7 * refactor(deinterleave): chunk16 + flat scalar tail in slice dispatchers The 16→8→4→scalar cascade in the 12 f32 slice-dispatcher impl bodies didn't earn its keep over a simpler chunk16 + scalar-tail pipeline. From the chunk-size A/B benchmark on this branch (7950X, AVX2): | size | chunk16+tail | cascade 16-8-4 | delta | | 16px | 21.4ns | 20.7ns | -3% | | 17px (+1 tail) | 24.1ns | 22.3ns | -8% | | 31px (+15 tail) | 31.2ns | 25.9ns | -17% | | 1024px (clean) | 184.2ns | 197.9ns | +7% | | 4099px (+3 tail) | 678.2ns | 738.7ns | +9% | | 65536px (L2) | 13.02µs | 12.89µs | -1% | The cascade only wins at sub-32px sizes that aren't representative of production image-row work (where rows are hundreds to thousands of pixels). At 1024-4099px (the realistic range), chunk16+tail is 7-9% faster — its single-loop structure reduces branch overhead and makes the hot path easier for LLVM to schedule. Each `*_impl_<tier>` body simplifies from a 4-stage cascade (chunk16, chunk8, chunk4, scalar) to 1 chunk16 loop + 1 flat scalar tail (up to 15 pixels). The chunk8 / chunk4 public fns stay — `zenanalyze/tier1.rs` calls `rgb24_chunk8_to_planes_v3` (now `_tokenless_v3`) and `zenpixels-convert/fast_gamut_v2.rs` calls chunk8 v3 + chunk4 neon/wasm128 directly under their own `#[arcane]` regions. They aren't part of the slice-dispatcher path. Touches all 12 dispatcher impls — 4 shapes × 3 tiers (v3/neon/wasm128). Net diff: 12 fns × ~50 lines → ~25 lines each, cuts roughly 300 LOC. Test plan: - x86_64 native: 216 tests pass - aarch64 cross + qemu: 214 tests pass - wasm32-wasip1 + wasmtime + simd128: 214 tests pass Tracking: #7 * bench(deinterleave): autovec(avx2) vs hand-written chunk SIMD Adds a `rgb_f32 autovec vs chunk SIMD` group to benches/deinterleave.rs comparing the current chunk-SIMD slice dispatcher path (`rgb_f32_to_planes_f32` → `*_chunk*_to_planes_tokenless_v3` → 128-bit `_mm_*` intrinsics) vs LLVM autovec on the same scalar source wrapped in `#[arcane(v3)]` (which lets LLVM use 256-bit YMM ops under target_feature avx2,fma). Findings (7950X, AVX2): | size | scatter chunk-SIMD | scatter autovec | autovec wins | | 64px | 29.7ns | 31.0ns | -4% | | 256px | 63.3ns | 48.8ns | +30% | | 1024px | 178.2ns | 130.5ns | +37% | | 4096px | 719.1ns | 703.9ns | +2% | | 16384px | 2691.3ns | 2857.3ns | -6% | | 65536px | 12.69µs | 13.03µs | -3% | | size | gather chunk-SIMD | gather autovec | autovec wins | | 64px | 29.7ns | 27.4ns | +8% | | 1024px | 195.8ns | 155.2ns | +26% | | 262144px | 51.6µs | 47.8µs | +8% | At 1024px (the realistic image-row size), autovec wins by 26-37%. At larger sizes (16K+) they're within ~5%, memory-bandwidth-dominated. At smaller sizes the picture is mixed but autovec mostly wins. Root cause: `mod x86_f32_chunks` uses 75 × `_mm_*` (128-bit XMM) ops and 0 × `_mm256_*` (256-bit YMM) ops. The "v3" naming on the f32 chunks is misleading — it's VEX-encoded SSE-style code, not real 256-bit AVX2. Direct asm-level confirmation: path YMM uses XMM uses -------------------------------------------------------------------------- autovec (scalar under #[arcane(v3)]) 20 6 chunk-SIMD (__arcane_rgb_f32_to_planes_impl_v3) 20 67 LLVM autovec emits 3.3× higher YMM-to-XMM ratio than the hand-written chunks. The hand-written code can't use 256-bit registers because it was written with 128-bit intrinsics throughout. Implication: the hand-written f32 chunk SIMD ships a net-negative implementation vs autovec at the size range that matters most for image processing (1024px scatter: autovec is 37% faster). Either: A. Drop hand-written f32 chunk SIMD from the slice dispatcher path; route through #[arcane(v3)] + scalar source (autovec). B. Rewrite chunk SIMD using actual 256-bit `_mm256_*` ops (`_mm256_loadu_ps`, `_mm256_permutevar8x32_ps`, ...). Separate engineering project. Public chunk fns (chunk4/chunk8/chunk16) stay regardless for callers that want to fuse them into their own SIMD loops (`zenanalyze/tier1.rs` and `zenpixels-convert/fast_gamut_v2.rs`), but they shouldn't be the path the slice dispatcher takes. Tracking: #7 * refactor(deinterleave): drop hand-written f32 chunk SIMD; route through autovec Per benchmarks/deinterleave_autovec_vs_chunk_2026-05-07: the 36 hand-written f32 chunk SIMD primitives in PR #5 used 128-bit XMM ops exclusively (75 × _mm_* vs 0 × _mm256_* in the f32 chunk module). The "_v3" naming was misleading — VEX-encoded SSE-style code, not 256-bit AVX2. LLVM autovec on the scalar loop body wrapped in #[arcane(v3)] beats the hand-written 128-bit chunks by 26-37% at 1024px (the realistic image-row size). Direct asm: autovec emits 20 YMM uses + 6 XMM; hand-written emits 20 YMM + 67 XMM. The autovec path uses 256-bit registers; the hand-written code can't because it was written with 128-bit intrinsics throughout. Changes: - Replace 12 f32 dispatcher impls with thin `#[arcane(<tier>)]` wrappers around the inline scalar loop. LLVM autovec lifts to 256-bit YMM under target_feature avx2,fma. - Delete `mod x86_f32_chunks`, `mod arm_f32_chunks`, `mod wasm_f32_chunks` (~1180 lines). 36 hand-written chunk SIMD primitives removed before they ever published. - Delete the 3 `pub use *_f32_chunks::{...}` re-export blocks. - Delete tests/asm_inline_check.rs (the chunks it inline-checked are gone). - Delete 36 in-file per-arch parity tests in `mod tests`. - Strip 2 chunk-related bench groups from benches/deinterleave.rs; the chunk-size-choice and autovec-vs-chunk groups served their purpose (proved this refactor was right) and are now dead. What stays: - 12 pure-scalar f32 chunk primitives (`*_chunk{4,8,16}_to_planes_scalar` and `planes_to_*_chunk{4,8,16}_scalar`). Useful for callers inside their own `#[arcane(<tier>)]` region — autovec lifts them to YMM. - 4 slice-level f32 dispatchers (the public API). - 2 u8/u16 tokenless chunk fns (`rgb24_chunk8_to_planes_tokenless_v3`, `rgb48_chunk8_to_planes_tokenless_v3`). These DO use 256-bit AVX2 (`_mm256_cvtepu8_epi32` widening + `_mm256_storeu_ps`). Caller: zenanalyze. Caller migration: zenpixels-convert/fast_gamut_v2.rs (only caller of PR #5's f32 chunks): rename `*_chunk{4,8}_to_planes_tokenless_{v3,neon,wasm128}` → `*_chunk{4,8}_to_planes_scalar`. Caller is already inside `#[arcane(<tier>)]` so the scalar body autovecs to 256-bit YMM. Verified: - x86_64 native: 203 tests pass (was 216 — diff is 36 deleted parity tests + 12 lost from asm_inline_check, vs 35 net counted) - aarch64 cross + qemu: 201 tests pass - wasm32-wasip1 + wasmtime + simd128: 201 tests pass - clippy clean - zenpixels-convert with garb 0.2.8 (path-patched): 259 lib tests pass File-level diff: src/deinterleave.rs -1180 lines (chunk modules) +12 dispatcher trims tests/asm_inline_check.rs deleted (-277 lines) benches/deinterleave.rs -451 lines (chunk-size + autovec-vs-chunk groups) zenpixels-convert/src/fast_gamut_v2.rs 12 call sites renamed Tracking: #7 * refactor(deinterleave): use #[autoversion] for f32 dispatchers Replace the manual `incant!` + `mod x86_f32 / arm_f32 / wasm_f32` + `pub(crate) fn ..._impl_<tier>` boilerplate with a single `#[autoversion(v3, neon, wasm128)]` attribute on each f32 public dispatcher. Before: 4 public fns + 4 `_impl_scalar` + 12 per-arch `_impl_<tier>` wrappers across 3 mods + 3 `use` blocks + 4 `incant!` calls = ~240 lines of dispatcher boilerplate. After: 4 public fns each decorated with `#[autoversion(v3, neon, wasm128)]`, body is the error checks + a direct call to the inline scalar loop helper. archmage's autoversion macro generates the per-tier variants + runtime dispatcher. The autoversion-generated code is asm-identical to the previous manual wiring on x86_64 (20 YMM uses + 6 XMM in `__arcane_*_v3`, same as before). Slice dispatchers still autovec to 256-bit YMM under target_feature avx2,fma — the simplification is purely cosmetic / DRY. The u8/u16 path (`rgb24_to_planes_f32`, `rgb48_to_planes_f32`) keeps its manual `incant!` + chunk-SIMD wiring because the hand-written `_mm_shuffle_epi8` chunk fns beat autovec by 21-52% at L1-L3 sizes (see `benchmarks/{rgb24,rgb48}_chunk_vs_autovec_2026-05-07`). Different shape from f32 — `_mm_shuffle_epi8` patterns autovec can't infer + real 256-bit AVX2 widening. Saved benches: benchmarks/rgb24_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD +21% to +52% at all non-memory-bound sizes benchmarks/rgb48_chunk_vs_autovec_2026-05-07.{log,meta} — chunk SIMD +22% to +49% src/deinterleave.rs: 1838 → 1587 lines (-251). Verified: - x86_64: 203 tests pass - aarch64: 201 tests pass - wasm32-wasip1: 201 tests pass - clippy clean Tracking: #7 * fix(deinterleave): drop redundant + mis-cfg-gated archmage imports The previous commit's import cleanup got mangled by `cargo fmt` reordering: the `#[cfg(target_arch = "x86_64")]` attribute on `use archmage::X64V3Token` ended up also gating `use archmage::autoversion`, even though autoversion is a proc-macro that needs to be available on all archs (it cfg-gates its generated tier variants internally). Tests still passed because `archmage::prelude::*` re-exports both `autoversion` and `incant` — the explicit `use` statements were redundant. Drop them; rely on `prelude::*`. Keep only the gated `X64V3Token` import (used by the existing `autovec_avx2_rgb24` u8 helper in the file body). Net delete: 2 lines. * docs(README): update deinterleave section for v0.2.8 surface The previous section described PR #5's never-shipped chunk-level `*_chunk8_to_planes_v3(_t: X64V3Token, ...)` API and called the f32 chunks "hand-tuned AVX2". After bench-driven trimming both claims are wrong. Replace with the actual v0.2.8 surface: - u8 / u16 paths: hand-written `_mm_shuffle_epi8` + 256-bit AVX2 widening, 21-52% faster than autovec (kept). - f32 paths: `#[autoversion]` over scalar, autovec to 256-bit YMM / NEON / wasm128 (drop the 128-bit hand-written chunks; they lost to autovec by 26-37% at 1024px). - Chunk-level public surface: u8 / u16 `_tokenless_v3` for SIMD fusion, plus `_scalar` chunks for both u8/u16 and f32 callers inside their own #[arcane(<tier>)] region. Cross-references the saved bench files for anyone wanting to see the data behind the design choices.
lilith
added a commit
that referenced
this pull request
May 8, 2026
The previous entry talked extensively about PR #5's never-published hand-written f32 chunk SIMD ("Not added (PR #5's...)") and the incant!→autoversion internal refactor as if either were user-visible API changes. They aren't — users only see deltas relative to released versions. Tightened to: - Removed (BREAKING): the 2 token-taking _v3 chunk fns - Added: 2 tokenless _v3 replacements + 12 scalar f32 chunk primitives - Changed (internal): incant!→autoversion in the existing slice dispatchers, no public API change - Migration: 1-line caller rename Bench evidence and PR-#5-supersession history live in benchmarks/*.meta and the GitHub issue/PR thread; not in CHANGELOG.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR (re-titled and expanded) lands two layers of work on
src/deinterleave.rs:vld3q_f32/vld4q_f32/vst3q_f32/vst4q_f32on AArch64; realvshufps/vinsertps/vblendps/vunpcklps/vmovlhpson x86_64 AVX2; reali8x16.shuffle/v128.load/v128.storeon wasm32 SIMD128. The slice-levelrgb_f32_to_planes_f32/rgba_f32_to_planes_f32/planes_f32_to_rgb_f32/planes_f32_to_rgba_f32#[arcane]dispatchers (which previously called the scalar loop body verbatim under atarget_feature = "avx2"/"neon"region) now route through these chunk SIMD primitives via a 16 → 8 → 4 → scalar tail pipeline.Public API
Twelve new shape-keys per chunk variant — chunk-{4, 8, 16} × {RGB, RGBA} × {deinterleave, interleave}:
The bare-name aliases from the original PR #5 (
rgb_f32_chunk4_to_planesetc.) are preserved as thin wrappers around*_scalarso existing callers compile untouched.cargo semver-checks check-release --all-featuresreportsSummary no semver update required— purely additive.Codegen verification
#[rite]on every per-arch chunk function so they fuse into the caller's#[arcane]/target_featureregion — nocall/binstruction at the chunk boundary. Verified bytests/asm_inline_check.rswhich dumps asample_caller_*_chunk16body and asserts the chunk function's SIMD ops appear inline.x86_64 AVX2 — `sample_caller_v3_rgb_chunk16` (chunk-16 RGB deinterleave, fully inlined)
vmovups/vshufps/vblendps/vinsertps/vbroadcastss).callinstructions to any*_chunk*_v3symbol — fully inlined.vmovss/movssloads.aarch64 NEON — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 NEON)
ld3 { v0.4s, v1.4s, v2.4s }per inner-loop iteration (= unrolled 2 × chunk-16) — exactly the hardware structure-load.ldr s0loads in the inner loop.ld4 { v0.4s, v1.4s, v2.4s, v3.4s }instead — 8 in the chunk-16 unrolled body.st3/st4symmetrically (8 each in the unrolled chunk-16 inner body).wasm32 SIMD128 — `garb::deinterleave::rgb_f32_to_planes_f32` (slice-level dispatch into chunk-16 wasm128)
WAT histogram across the full slice-level body (built with
RUSTFLAGS="-C target-feature=+simd128"):Test plan
cargo test --release --features experimental(x86_64) — 228 lib tests pass, including 12 byte-exact*_v3 vs *_scalarparity tests for the new SIMD chunks.cross test --release --features experimental --target aarch64-unknown-linux-gnu(qemu) — 209 lib tests pass, including 12 byte-exact*_neon vs *_scalarparity tests.RUSTFLAGS="-C target-feature=+simd128" cargo test --target wasm32-wasip1 --features experimental --release(wasmtime) — 209 lib tests pass, including 12 byte-exact*_wasm128 vs *_scalarparity tests.tests/asm_inline_check.rs— runs as a smoke test on each arch when the host has the required ISA; verifies the#[rite]chunk functions inline cleanly into a sample#[arcane]caller. Passes on x86_64 native + aarch64 cross.cargo semver-checks check-release --all-features—Summary no semver update required(purely additive).cargo clippy --release --all-features --lib --tests— clean.Existing tests retained (round-trip / slice-API-equivalence / order-preservation) — they now exercise the SIMD path on each arch instead of bare scalar.
Notes
Cargo.toml; release cadence is the maintainer's call.*_impl_wasm128route is wired up viaincant!(...[v3, neon, wasm128, scalar])for symmetry; existing slice-level callers benefit on wasm32 with+simd128._scalarvariants stay public so callers in non-SIMD regions (or with their own dispatch) can use them without going through the slice-level dispatcher.zenpixels-convert's narrow body (the original use case) is a separate follow-up.