Skip to content

Peer API RMA changes + SHM clean up#2

Merged
shijin-aws merged 3 commits intoshijin-aws:peer_develfrom
sunkuamzn:peer-api-rma
Mar 20, 2023
Merged

Peer API RMA changes + SHM clean up#2
shijin-aws merged 3 commits intoshijin-aws:peer_develfrom
sunkuamzn:peer-api-rma

Conversation

@sunkuamzn
Copy link
Copy Markdown
Collaborator

No description provided.


if (peer->is_local && rxr_ep->use_shm_for_tx) {
use_lower_ep_read = true;
// msg_clone->addr = peer->shm_fiaddr;
Copy link
Copy Markdown
Collaborator Author

@sunkuamzn sunkuamzn Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fi_rma_bw test passes when I use msg and msg_clone. I don't know how that's possible..

I have to use msg_clone to set the address because the input parameter msg is const. Using msg_clone generates a compiler warning.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fabtests only has 1 peer that that the shm_fiaddr could be the same with efa addr. We should use shm_fiaddr for any fi_* call for shm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see OK

What's the best way to do this? The msg_clone idea which I have here generates a compiler warning. I could do a memcpy but that would be more expensive?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are rxr_msg_construct and rxr_tmsg_construct in rxr_msg.h. You can do similar things to have a function to construct a new fi_rma_msg

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Signed-off-by: Sai Sunku <sunkusa@amazon.com>
Signed-off-by: Sai Sunku <sunkusa@amazon.com>
@shijin-aws shijin-aws merged commit f8ea3f9 into shijin-aws:peer_devel Mar 20, 2023
shijin-aws pushed a commit that referenced this pull request Sep 12, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
shijin-aws added a commit that referenced this pull request Nov 24, 2025
We need such barrier in two transitions:
1. from wqe write to doorbell ring
2. from doorbell ring to wqe write.

For #2, we currently have the barrier right after the doorbell ring,
which is different from rdma-core implementation that has the doorbell
right in the next wr_start, and according to benchmark such choice
hurts performance on ARM platform. This patch moves the #2 barrier
to the beginning of wr start, before the wqe write, to be consistent
with rdma-core and improves the performance.

Signed-off-by: Shi Jin <sjina@amazon.com>
shijin-aws added a commit that referenced this pull request Nov 25, 2025
We need such barrier in two transitions:
1. from wqe write to doorbell ring
2. from doorbell ring to wqe write.

For #2, we currently have the barrier right after the doorbell ring,
which is different from rdma-core implementation that has the doorbell
right in the next wr_start, and according to benchmark such choice
hurts performance on ARM platform. This patch moves the #2 barrier
to the beginning of wr start, before the wqe write, to be consistent
with rdma-core and improves the performance.

Signed-off-by: Shi Jin <sjina@amazon.com>
shijin-aws added a commit that referenced this pull request Dec 1, 2025
We need such barrier in two transitions:
1. from wqe write to doorbell ring
2. from doorbell ring to wqe write.

For #2, we currently have the barrier right after the doorbell ring,
which is different from rdma-core implementation that has the doorbell
right in the next wr_start, and according to benchmark such choice
hurts performance on ARM platform. This patch moves the #2 barrier
to the beginning of wr start, before the wqe write, to be consistent
with rdma-core and improves the performance.

Signed-off-by: Shi Jin <sjina@amazon.com>
(cherry picked from commit c4937ed)
shijin-aws added a commit that referenced this pull request Dec 29, 2025
We need such barrier in two transitions:
1. from wqe write to doorbell ring
2. from doorbell ring to wqe write.

For #2, we currently have the barrier right after the doorbell ring,
which is different from rdma-core implementation that has the doorbell
right in the next wr_start, and according to benchmark such choice
hurts performance on ARM platform. This patch moves the #2 barrier
to the beginning of wr start, before the wqe write, to be consistent
with rdma-core and improves the performance.

Signed-off-by: Shi Jin <sjina@amazon.com>
(cherry picked from commit c4937ed)
shijin-aws pushed a commit that referenced this pull request Feb 13, 2026
…I_EFA_ENABLE_SHM_TRANSFER=0

This patch fixes two related bugs involving uninitialized descriptor arrays that
cause segmentation faults when FI_EFA_ENABLE_SHM_TRANSFER=0 is set, but only in
release builds with -DNDEBUG.

Bug #1: Inconsistent memset in efa_rdm_txe_construct()
When msg->desc is NULL, the code zeroed the entire desc array with:
    memset(txe->desc, 0, sizeof(txe->desc));
This zeros all EFA_RDM_IOV_LIMIT elements, but the memcpy branch only copies
iov_count elements. The fix uses sizeof(*txe->desc) * msg->iov_count to match
the memcpy() branch and the parallel iov[] and mr[] arrays.

Bug #2: Dangling pointer in efa_rdm_atomic_ex
The compare_desc field was declared as void **compare_desc (pointer), which was
set to point to the caller's stack memory. When the caller returned, this became
a dangling pointer. Later access to compare_desc[i] would read garbage from
reused stack memory.
The fix changes compare_desc to an array void *compare_desc[EFA_RDM_IOV_LIMIT]
with its own persistent storage, and copies the data instead of storing a pointer.

Why debug builds didn't crash:
Debug builds zero-initialize memory, so uninitialized desc[i] values were NULL.
The if (desc) check in efa_copy_from_hmem() would fail, avoiding the dereference.
Release builds leave memory uninitialized, so desc[i] contained garbage pointers
like 0x73e54aab7848. The if (desc) check would pass, then dereference the garbage
pointer, causing a segmentation fault.

Tested on c7i.48xlarge clusters with 96-192 ranks using IMB-EXT Accumulate
benchmark in both debug (-g -O0) and release (-DNDEBUG -O2) builds.

Signed-off-by: Yin Li <yinliq@amazon.com>
shijin-aws pushed a commit that referenced this pull request Feb 23, 2026
…lt with FI_EFA_ENABLE_SHM_TRANSFER=0

This patch fixes two related bugs involving uninitialized descriptor arrays that
cause segmentation faults when FI_EFA_ENABLE_SHM_TRANSFER=0 is set, but only in
release builds with -DNDEBUG.

Bug #1: Inconsistent memset in efa_rdm_txe_construct()
When msg->desc is NULL, the code zeroed the entire desc array with:
    memset(txe->desc, 0, sizeof(txe->desc));
This zeros all EFA_RDM_IOV_LIMIT elements, but the memcpy branch only copies
iov_count elements. The fix uses sizeof(*txe->desc) * msg->iov_count to match
the memcpy() branch and the parallel iov[] and mr[] arrays.

Bug #2: Dangling pointer in efa_rdm_atomic_ex
The compare_desc field was declared as void **compare_desc (pointer), which was
set to point to the caller's stack memory. When the caller returned, this became
a dangling pointer. Later access to compare_desc[i] would read garbage from
reused stack memory.
The fix changes compare_desc to an array void *compare_desc[EFA_RDM_IOV_LIMIT]
with its own persistent storage, and copies the data instead of storing a pointer.

Why debug builds didn't crash:
Debug builds zero-initialize memory, so uninitialized desc[i] values were NULL.
The if (desc) check in efa_copy_from_hmem() would fail, avoiding the dereference.
Release builds leave memory uninitialized, so desc[i] contained garbage pointers
like 0x73e54aab7848. The if (desc) check would pass, then dereference the garbage
pointer, causing a segmentation fault.

Tested on c7i.48xlarge clusters with 96-192 ranks using IMB-EXT Accumulate
benchmark in both debug (-g -O0) and release (-DNDEBUG -O2) builds.

Signed-off-by: Yin Li <yinliq@amazon.com>
(cherry picked from commit 6797933)
Signed-off-by: Shi Jin <sjina@amazon.com>
shijin-aws pushed a commit that referenced this pull request Feb 26, 2026
…lt with FI_EFA_ENABLE_SHM_TRANSFER=0

This patch fixes two related bugs involving uninitialized descriptor arrays that
cause segmentation faults when FI_EFA_ENABLE_SHM_TRANSFER=0 is set, but only in
release builds with -DNDEBUG.

Bug #1: Inconsistent memset in efa_rdm_txe_construct()
When msg->desc is NULL, the code zeroed the entire desc array with:
    memset(txe->desc, 0, sizeof(txe->desc));
This zeros all EFA_RDM_IOV_LIMIT elements, but the memcpy branch only copies
iov_count elements. The fix uses sizeof(*txe->desc) * msg->iov_count to match
the memcpy() branch and the parallel iov[] and mr[] arrays.

Bug #2: Dangling pointer in efa_rdm_atomic_ex
The compare_desc field was declared as void **compare_desc (pointer), which was
set to point to the caller's stack memory. When the caller returned, this became
a dangling pointer. Later access to compare_desc[i] would read garbage from
reused stack memory.
The fix changes compare_desc to an array void *compare_desc[EFA_RDM_IOV_LIMIT]
with its own persistent storage, and copies the data instead of storing a pointer.

Why debug builds didn't crash:
Debug builds zero-initialize memory, so uninitialized desc[i] values were NULL.
The if (desc) check in efa_copy_from_hmem() would fail, avoiding the dereference.
Release builds leave memory uninitialized, so desc[i] contained garbage pointers
like 0x73e54aab7848. The if (desc) check would pass, then dereference the garbage
pointer, causing a segmentation fault.

Tested on c7i.48xlarge clusters with 96-192 ranks using IMB-EXT Accumulate
benchmark in both debug (-g -O0) and release (-DNDEBUG -O2) builds.

Signed-off-by: Yin Li <yinliq@amazon.com>
(cherry picked from commit 6797933)
Signed-off-by: Shi Jin <sjina@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants