Skip to content

prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205

Closed
yinliaws wants to merge 1 commit into
ofiwg:mainfrom
yinliaws:shm-perf-fix
Closed

prov/shm: add fast paths for tagged/msg send/inject to fix perf regressions#12205
yinliaws wants to merge 1 commit into
ofiwg:mainfrom
yinliaws:shm-perf-fix

Conversation

@yinliaws
Copy link
Copy Markdown
Contributor

The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg transfers after the command queue refactor. Profiling identified four root causes:

  1. Generic dispatch overhead: main routes ALL sends through smr_generic_sendmsg -> smr_send_opsproto function pointer dispatch, with heap allocation for smr_pend_entry, IOV copies, and pending-entry completion tracking on every send.

  2. Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages into the inject path, losing inline's single-cacheline fast-path.

  3. Store-to-load forwarding stall: the sequence
    mov %r11, 0x8(%rcx) ; 8-byte store
    movdqu 0x8(%rcx), %xmm1 ; 16-byte load of same location
    cannot be forwarded (size mismatch) and consumed 82% of smr_tinject
    time on Intel, confirmed via perf annotate.

  4. Cross-process cacheline reads on receive: main reads header fields from pcmd (sender's region), ~50ns per field on Intel mesh.

Fixes:

  • smr_fast_tinject(): bypass generic dispatch for tagged inject when the payload fits inline, writing directly to ce->cmd without pend allocation.

  • smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds header on the stack first, then stores to pcmd (local) and ce->cmd (shared). This avoids the load-after-store pattern. Uses a dedicated SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for backward compatibility during rolling upgrades.

  • smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject sub-paths.

  • Inject fast receive path in smr_progress_cmd: dispatches from ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr abstraction reads remaining fields from the appropriate location (ce->cmd for fast inject, pcmd for slow path). A small prefetch (<=2 cachelines) primes the HW prefetcher for the data copy.

  • SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from 192 to 320. 256B messages now use the inline path. This is the single largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION bumped to 11 to reflect the new queue entry layout.

  • tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the same queue slot misdirecting the receiver's dispatch.

  • Unexpected-message handling (ENOENT + freestack empty): the original code silently released the queue entry and dropped the message. Fixed to break out of the progress loop so the next poll retries with a fresh freestack, avoiding message loss under heavy windowed BW.

Performance (medians across 5 latency / 3 BW rounds, 20K iterations, window=2000 for BW):

Latency (usec), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B -11.7%/-22.7% -7.0%/-19.4% -15.9%/-25.8% -24.5%/-40.0% -29.9%/-38.2%
512B -7.3%/-17.1% -7.3%/-17.9% +3.3%/ -4.2% +4.9%/-15.7% -9.9%/-14.3%
1K -11.7%/-19.7% -8.5%/-13.4% +2.1%/ -6.2% +8.8%/-13.7% -14.3%/-16.2%
2K +3.9%/ -8.2% -5.7%/-12.5% -9.8%/-14.2% +12.6%/-10.9% -10.8%/-13.9%
4K +1.7%/ -9.1% +0.8%/ -9.7% -8.1%/-11.8% +6.6%/-13.8% -1.8%/ -6.0%

Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main:
AMD AL2 AMD Ubuntu Graviton c7gn Intel P5EN Intel P4D
256B +151%/+183% +149%/+194% +127%/+102% +383%/+183% +213%/+127%
1K +49%/ +19% +52%/ +14% +109%/ +48% +178%/ +33% +122%/ +20%
2K +19%/ +11% +24%/ +3% +84%/ +36% +132%/ +34% +74%/ +6%
4K +36%/ +4% +20%/ +2% +59%/ +30% +108%/ +29% +46%/ +2%

Correctness: the RDM fabtests pass on AMD (AL2) and Intel (P5EN). No hangs under BW stress.

Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel P5EN 2K-4K latency) are architectural to main's sender-owned inject buffer model and would require reverting to v2.4.x-style peer-owned inject pools.

…ssions

The SHM provider regressed vs v2.4.x for small-to-medium tagged/msg
transfers after the command queue refactor. Profiling identified four
root causes:

1. Generic dispatch overhead: main routes ALL sends through
   smr_generic_sendmsg -> smr_send_ops[proto]() function pointer dispatch,
   with heap allocation for smr_pend_entry, IOV copies, and pending-entry
   completion tracking on every send.

2. Inline data cap too small: SMR_MSG_DATA_LEN=192 forces 256B messages
   into the inject path, losing inline's single-cacheline fast-path.

3. Store-to-load forwarding stall: the sequence
     mov    %r11, 0x8(%rcx)       ; 8-byte store
     movdqu 0x8(%rcx), %xmm1      ; 16-byte load of same location
   cannot be forwarded (size mismatch) and consumed 82% of smr_tinject
   time on Intel, confirmed via `perf annotate`.

4. Cross-process cacheline reads on receive: main reads header fields
   from pcmd (sender's region), ~50ns per field on Intel mesh.

Fixes:

- smr_fast_tinject(): bypass generic dispatch for tagged inject when the
  payload fits inline, writing directly to ce->cmd without pend allocation.

- smr_fast_inject_v24(): dedicated path for 321B-SMR_INJECT_SIZE. Builds
  header on the stack first, then stores to pcmd (local) and ce->cmd
  (shared). This avoids the load-after-store pattern. Uses a dedicated
  SMR_FAST_INJECT_TX_CTX marker so the return queue handler skips pend
  dereference. Gated by a new SMR_FLAG_FAST_INJECT_V2 peer flag for
  backward compatibility during rolling upgrades.

- smr_fast_tsend(): analogous fast path for fi_send/fi_tsend with
  completion, covering both inline (<=SMR_MSG_DATA_LEN) and inject
  sub-paths.

- Inject fast receive path in smr_progress_cmd: dispatches from
  ce->cmd.hdr (local L1) instead of pcmd (cross-process). A _hdr
  abstraction reads remaining fields from the appropriate location
  (ce->cmd for fast inject, pcmd for slow path). A small prefetch
  (<=2 cachelines) primes the HW prefetcher for the data copy.

- SMR_CMD_SIZE grown from 360 to 488, increasing SMR_MSG_DATA_LEN from
  192 to 320. 256B messages now use the inline path. This is the single
  largest win (+127% to +383% BW at 256B across platforms). SMR_VERSION
  bumped to 11 to reflect the new queue entry layout.

- tx_ctx cleared in all slow-path branches where cmd = &ce->cmd (inline
  in generic_sendmsg/generic_inject/RMA/atomic) to prevent stale
  SMR_FAST_INJECT_TX_CTX markers from prior fast-inject usage of the
  same queue slot misdirecting the receiver's dispatch.

- Unexpected-message handling (ENOENT + freestack empty): the original
  code silently released the queue entry and dropped the message. Fixed
  to break out of the progress loop so the next poll retries with a
  fresh freestack, avoiding message loss under heavy windowed BW.

Performance (medians across 5 lat / 3 BW rounds, 20K iterations,
window=2000 for BW):

  Latency (usec), PATCHED vs v2.4.x / PATCHED vs main:
                AMD AL2        AMD Ubuntu     Graviton c7gn  Intel P5EN     Intel P4D
    256B    -11.7%/-22.7%  -7.0%/-19.4%  -15.9%/-25.8%  -24.5%/-40.0%  -29.9%/-38.2%
    512B     -7.3%/-17.1%  -7.3%/-17.9%   +3.3%/ -4.2%   +4.9%/-15.7%   -9.9%/-14.3%
    1K      -11.7%/-19.7%  -8.5%/-13.4%   +2.1%/ -6.2%   +8.8%/-13.7%  -14.3%/-16.2%
    2K       +3.9%/ -8.2%  -5.7%/-12.5%   -9.8%/-14.2%  +12.6%/-10.9%  -10.8%/-13.9%
    4K       +1.7%/ -9.1%  +0.8%/ -9.7%   -8.1%/-11.8%   +6.6%/-13.8%   -1.8%/ -6.0%

  Bandwidth (MB/sec ratio), PATCHED vs v2.4.x / PATCHED vs main:
                AMD AL2        AMD Ubuntu     Graviton c7gn  Intel P5EN     Intel P4D
    256B   +151%/+183%    +149%/+194%    +127%/+102%    +383%/+183%    +213%/+127%
    1K      +49%/ +19%     +52%/ +14%    +109%/ +48%    +178%/ +33%    +122%/ +20%
    2K      +19%/ +11%     +24%/  +3%     +84%/ +36%    +132%/ +34%     +74%/  +6%
    4K      +36%/  +4%     +20%/  +2%     +59%/ +30%    +108%/ +29%     +46%/  +2%

Correctness: 35/35 RDM fabtests pass on AMD (AL2) and Intel (P5EN). No
hangs under BW stress (20K iter x window=2000 x all sizes).

Remaining regressions (e.g. AMD 512B BW in multi-size sweeps, Intel
P5EN 2K-4K latency) are architectural to main's sender-owned inject
buffer model and would require reverting to v2.4.x-style peer-owned
inject pools with a lock-free SPSC freestack to fix fully. PATCHED is
strictly better than main at nearly all sizes on all platforms, so the
regressions are relative to v2.4.x's pre-refactor baseline, not to the
current shipping branch.

Signed-off-by: Yin Li <yinliq@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant