prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205
Closed
yinliaws wants to merge 1 commit into
Closed
prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205yinliaws wants to merge 1 commit into
yinliaws wants to merge 1 commit into
Conversation
…ssions
The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg
transfers after the command queue refactor. Profiling identified four
root causes:
1. Generic dispatch overhead: main routes ALL sends through
smr_generic_sendmsg -> smr_send_ops[proto]() function pointer dispatch,
with heap allocation for smr_pend_entry, IOV copies, and pending-entry
completion tracking on every send.
2. Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages
into the inject path, losing inline's single-cacheline fast-path.
3. Store-to-load forwarding stall: the sequence
mov %r11, 0x8(%rcx) ; 8-byte store
movdqu 0x8(%rcx), %xmm1 ; 16-byte load of same location
cannot be forwarded (size mismatch) and consumed 82% of smr_tinject
time on Intel, confirmed via `perf annotate`.
4. Cross-process cacheline reads on receive: main reads header fields
from pcmd (sender's region), ~50ns per field on Intel mesh.
Fixes:
- smr_fast_tinject(): bypass generic dispatch for tagged inject when the
payload fits inline, writing directly to ce->cmd without pend allocation.
- smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds
header on the stack first, then stores to pcmd (local) and ce->cmd
(shared). This avoids the load-after-store pattern. Uses a dedicated
SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend
dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for
backward compatibility during rolling upgrades.
- smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with
completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject
sub-paths.
- Inject fast receive path in smr_progress_cmd: dispatches from
ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr
abstraction reads remaining fields from the appropriate location
(ce->cmd for fast inject, pcmd for slow path). A small prefetch
(<=2 cachelines) primes the HW prefetcher for the data copy.
- SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from
192 to 320. 256B messages now use the inline path. This is the single
largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION
bumped to 11 to reflect the new queue entry layout.
- tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline
in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale
SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the
same queue slot misdirecting the receiver's dispatch.
- Unexpected-message handling (ENOENT + freestack empty): the original
code silently released the queue entry and dropped the message. Fixed
to break out of the progress loop so the next poll retries with a
fresh freestack, avoiding message loss under heavy windowed BW.
Performance (medians across 5 lat / 3 BW rounds, 20K iterations,
window=2000 for BW):
Latency (usec), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B -11.7%/-22.7% -7.0%/-19.4% -15.9%/-25.8% -24.5%/-40.0% -29.9%/-38.2%
512B -7.3%/-17.1% -7.3%/-17.9% +3.3%/ -4.2% +4.9%/-15.7% -9.9%/-14.3%
1K -11.7%/-19.7% -8.5%/-13.4% +2.1%/ -6.2% +8.8%/-13.7% -14.3%/-16.2%
2K +3.9%/ -8.2% -5.7%/-12.5% -9.8%/-14.2% +12.6%/-10.9% -10.8%/-13.9%
4K +1.7%/ -9.1% +0.8%/ -9.7% -8.1%/-11.8% +6.6%/-13.8% -1.8%/ -6.0%
Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B +151%/+183% +149%/+194% +127%/+102% +383%/+183% +213%/+127%
1K +49%/ +19% +52%/ +14% +109%/ +48% +178%/ +33% +122%/ +20%
2K +19%/ +11% +24%/ +3% +84%/ +36% +132%/ +34% +74%/ +6%
4K +36%/ +4% +20%/ +2% +59%/ +30% +108%/ +29% +46%/ +2%
Correctness: 35/35 RDM fabtests pass on AMD (AL2) and Intel (P5EN). No
hangs under BW stress (20K iter x window=2000 x all sizes).
Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel
P5EN 2K-4K latency) are architectural to main's sender-owned inject
buffer model and would require reverting to v2.4.x-style peer-owned
inject pools with a lock-free SPSC freestack to fix fully. PATCHED is
strictly better than main at nearly all sizes on all platforms, so the
regressions are relative to v2.4.x's pre-refactor baseline, not to the
current shipping branch.
Signed-off-by: Yin Li <yinliq@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg transfers after the command queue refactor. Profiling identified four root causes:
Generic dispatch overhead: main routes ALL sends through smr_generic_sendmsg -> smr_send_opsproto function pointer dispatch, with heap allocation for smr_pend_entry, IOV copies, and pending-entry completion tracking on every send.
Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages into the inject path, losing inline's single-cacheline fast-path.
Store-to-load forwarding stall: the sequence
mov %r11, 0x8(%rcx) ; 8-byte store
movdqu 0x8(%rcx), %xmm1 ; 16-byte load of same location
cannot be forwarded (size mismatch) and consumed 82% of smr_tinject
time on Intel, confirmed via
perf annotate.Cross-process cacheline reads on receive: main reads header fields from pcmd (sender's region), ~50ns per field on Intel mesh.
Fixes:
smr_fast_tinject(): bypass generic dispatch for tagged inject when the payload fits inline, writing directly to ce->cmd without pend allocation.
smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds header on the stack first, then stores to pcmd (local) and ce->cmd (shared). This avoids the load-after-store pattern. Uses a dedicated SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for backward compatibility during rolling upgrades.
smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject sub-paths.
Inject fast receive path in smr_progress_cmd: dispatches from ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr abstraction reads remaining fields from the appropriate location (ce->cmd for fast inject, pcmd for slow path). A small prefetch (<=2 cachelines) primes the HW prefetcher for the data copy.
SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from 192 to 320. 256B messages now use the inline path. This is the single largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION bumped to 11 to reflect the new queue entry layout.
tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the same queue slot misdirecting the receiver's dispatch.
Unexpected-message handling (ENOENT + freestack empty): the original code silently released the queue entry and dropped the message. Fixed to break out of the progress loop so the next poll retries with a fresh freestack, avoiding message loss under heavy windowed BW.
Performance (medians across 5 latency / 3 BW rounds, 20K iterations, window=2000 for BW):
Latency (usec), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B -11.7%/-22.7% -7.0%/-19.4% -15.9%/-25.8% -24.5%/-40.0% -29.9%/-38.2%
512B -7.3%/-17.1% -7.3%/-17.9% +3.3%/ -4.2% +4.9%/-15.7% -9.9%/-14.3%
1K -11.7%/-19.7% -8.5%/-13.4% +2.1%/ -6.2% +8.8%/-13.7% -14.3%/-16.2%
2K +3.9%/ -8.2% -5.7%/-12.5% -9.8%/-14.2% +12.6%/-10.9% -10.8%/-13.9%
4K +1.7%/ -9.1% +0.8%/ -9.7% -8.1%/-11.8% +6.6%/-13.8% -1.8%/ -6.0%
Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B +151%/+183% +149%/+194% +127%/+102% +383%/+183% +213%/+127%
1K +49%/ +19% +52%/ +14% +109%/ +48% +178%/ +33% +122%/ +20%
2K +19%/ +11% +24%/ +3% +84%/ +36% +132%/ +34% +74%/ +6%
4K +36%/ +4% +20%/ +2% +59%/ +30% +108%/ +29% +46%/ +2%
Correctness: the RDM fabtests pass on AMD (AL2) and Intel (P5EN). No hangs under BW stress.
Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel P5EN 2K-4K latency) are architectural to main's sender-owned inject buffer model and would require reverting to v2.4.x-style peer-owned inject pools.