prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions by yinliaws · Pull Request #12205 · ofiwg/libfabric

yinliaws · 2026-04-30T21:06:13Z

The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg transfers after the command queue refactor. Profiling identified four root causes:

Generic dispatch overhead: main routes ALL sends through smr_generic_sendmsg -> smr_send_opsproto function pointer dispatch, with heap allocation for smr_pend_entry, IOV copies, and pending-entry completion tracking on every send.
Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages into the inject path, losing inline's single-cacheline fast-path.
Store-to-load forwarding stall: the sequence
mov %r11, 0x8(%rcx) ; 8-byte store
movdqu 0x8(%rcx), %xmm1 ; 16-byte load of same location
cannot be forwarded (size mismatch) and consumed 82% of smr_tinject
time on Intel, confirmed via perf annotate.
Cross-process cacheline reads on receive: main reads header fields from pcmd (sender's region), ~50ns per field on Intel mesh.

Fixes:

smr_fast_tinject(): bypass generic dispatch for tagged inject when the payload fits inline, writing directly to ce->cmd without pend allocation.
smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds header on the stack first, then stores to pcmd (local) and ce->cmd (shared). This avoids the load-after-store pattern. Uses a dedicated SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for backward compatibility during rolling upgrades.
smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject sub-paths.
Inject fast receive path in smr_progress_cmd: dispatches from ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr abstraction reads remaining fields from the appropriate location (ce->cmd for fast inject, pcmd for slow path). A small prefetch (<=2 cachelines) primes the HW prefetcher for the data copy.
SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from 192 to 320. 256B messages now use the inline path. This is the single largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION bumped to 11 to reflect the new queue entry layout.
tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the same queue slot misdirecting the receiver's dispatch.
Unexpected-message handling (ENOENT + freestack empty): the original code silently released the queue entry and dropped the message. Fixed to break out of the progress loop so the next poll retries with a fresh freestack, avoiding message loss under heavy windowed BW.

Performance (medians across 5 latency / 3 BW rounds, 20K iterations, window=2000 for BW):

Latency (usec), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B -11.7%/-22.7% -7.0%/-19.4% -15.9%/-25.8% -24.5%/-40.0% -29.9%/-38.2%
512B -7.3%/-17.1% -7.3%/-17.9% +3.3%/ -4.2% +4.9%/-15.7% -9.9%/-14.3%
1K -11.7%/-19.7% -8.5%/-13.4% +2.1%/ -6.2% +8.8%/-13.7% -14.3%/-16.2%
2K +3.9%/ -8.2% -5.7%/-12.5% -9.8%/-14.2% +12.6%/-10.9% -10.8%/-13.9%
4K +1.7%/ -9.1% +0.8%/ -9.7% -8.1%/-11.8% +6.6%/-13.8% -1.8%/ -6.0%

Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B +151%/+183% +149%/+194% +127%/+102% +383%/+183% +213%/+127%
1K +49%/ +19% +52%/ +14% +109%/ +48% +178%/ +33% +122%/ +20%
2K +19%/ +11% +24%/ +3% +84%/ +36% +132%/ +34% +74%/ +6%
4K +36%/ +4% +20%/ +2% +59%/ +30% +108%/ +29% +46%/ +2%

Correctness: the RDM fabtests pass on AMD (AL2) and Intel (P5EN). No hangs under BW stress.

Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel P5EN 2K-4K latency) are architectural to main's sender-owned inject buffer model and would require reverting to v2.4.x-style peer-owned inject pools.

…ssions The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg transfers after the command queue refactor. Profiling identified four root causes: 1. Generic dispatch overhead: main routes ALL sends through smr_generic_sendmsg -> smr_send_ops[proto]() function pointer dispatch, with heap allocation for smr_pend_entry, IOV copies, and pending-entry completion tracking on every send. 2. Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages into the inject path, losing inline's single-cacheline fast-path. 3. Store-to-load forwarding stall: the sequence mov %r11, 0x8(%rcx) ; 8-byte store movdqu 0x8(%rcx), %xmm1 ; 16-byte load of same location cannot be forwarded (size mismatch) and consumed 82% of smr_tinject time on Intel, confirmed via `perf annotate`. 4. Cross-process cacheline reads on receive: main reads header fields from pcmd (sender's region), ~50ns per field on Intel mesh. Fixes: - smr_fast_tinject(): bypass generic dispatch for tagged inject when the payload fits inline, writing directly to ce->cmd without pend allocation. - smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds header on the stack first, then stores to pcmd (local) and ce->cmd (shared). This avoids the load-after-store pattern. Uses a dedicated SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for backward compatibility during rolling upgrades. - smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject sub-paths. - Inject fast receive path in smr_progress_cmd: dispatches from ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr abstraction reads remaining fields from the appropriate location (ce->cmd for fast inject, pcmd for slow path). A small prefetch (<=2 cachelines) primes the HW prefetcher for the data copy. - SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from 192 to 320. 256B messages now use the inline path. This is the single largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION bumped to 11 to reflect the new queue entry layout. - tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the same queue slot misdirecting the receiver's dispatch. - Unexpected-message handling (ENOENT + freestack empty): the original code silently released the queue entry and dropped the message. Fixed to break out of the progress loop so the next poll retries with a fresh freestack, avoiding message loss under heavy windowed BW. Performance (medians across 5 lat / 3 BW rounds, 20K iterations, window=2000 for BW): Latency (usec), PATCHED vs v2.4.x / PATCHED vs main: AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D 256B -11.7%/-22.7% -7.0%/-19.4% -15.9%/-25.8% -24.5%/-40.0% -29.9%/-38.2% 512B -7.3%/-17.1% -7.3%/-17.9% +3.3%/ -4.2% +4.9%/-15.7% -9.9%/-14.3% 1K -11.7%/-19.7% -8.5%/-13.4% +2.1%/ -6.2% +8.8%/-13.7% -14.3%/-16.2% 2K +3.9%/ -8.2% -5.7%/-12.5% -9.8%/-14.2% +12.6%/-10.9% -10.8%/-13.9% 4K +1.7%/ -9.1% +0.8%/ -9.7% -8.1%/-11.8% +6.6%/-13.8% -1.8%/ -6.0% Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main: AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D 256B +151%/+183% +149%/+194% +127%/+102% +383%/+183% +213%/+127% 1K +49%/ +19% +52%/ +14% +109%/ +48% +178%/ +33% +122%/ +20% 2K +19%/ +11% +24%/ +3% +84%/ +36% +132%/ +34% +74%/ +6% 4K +36%/ +4% +20%/ +2% +59%/ +30% +108%/ +29% +46%/ +2% Correctness: 35/35 RDM fabtests pass on AMD (AL2) and Intel (P5EN). No hangs under BW stress (20K iter x window=2000 x all sizes). Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel P5EN 2K-4K latency) are architectural to main's sender-owned inject buffer model and would require reverting to v2.4.x-style peer-owned inject pools with a lock-free SPSC freestack to fix fully. PATCHED is strictly better than main at nearly all sizes on all platforms, so the regressions are relative to v2.4.x's pre-refactor baseline, not to the current shipping branch. Signed-off-by: Yin Li <yinliq@amazon.com>

yinliaws force-pushed the shm-perf-fix branch from 0dc9559 to 97b2184 Compare May 1, 2026 17:08

yinliaws closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205

prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205
yinliaws wants to merge 1 commit into
ofiwg:mainfrom
yinliaws:shm-perf-fix

yinliaws commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yinliaws commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant