librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702
librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702BatshevaBlack wants to merge 14 commits intolinux-rdma:masterfrom
Conversation
227fd60 to
959cb7e
Compare
This commit introduces epoll_create functionality to support a centralized thread for managing all epoll instances. The epoll_create call creates an epoll_inst struct and two epoll file descriptors: a "regular epfd" for handling real file descriptors and another epfd that includes the "regular epfd" added using epoll_ctl. The latter epfd is returned from the epoll_create function. Additionally, the new epoll instance is registered with a global thread that processes all instances in a round-robin fashion, efficiently handling events for both regular and rsocket file descriptors. The global thread manages polling in two steps for each epoll instance. First, it iterates through the list of rsocket fds in the epoll struct, polling each one to check for events. Second, it calls epoll_wait on the "regular epfd" to gather events from the real file descriptors. The thread keeps the events in the struct, and proceeds to the next epoll instance. Signed-off-by: Batsheva Black <bblack@nvidia.com>
This commit implements epoll_ctl with tailored handling for real and rsocket file descriptors. For regular file descriptors, epoll_ctl directly operates on the "regular epfd". For rsocket file descriptors, they are added to a dedicated list maintained in the epoll instance struct. This list ensures that the global thread can handle these file descriptors during its polling cycle. epoll_ctl triggers the thread to reprocess the epoll instance to update the ready list. Reflecting any events on the newly added file descriptors. Signed-off-by: Batsheva Black <bblack@nvidia.com>
bee544e to
7e7f2b6
Compare
This commit implements epoll_wait to retrieve events processed by the centralized thread for an epoll instance. When epoll_wait is called, it copies the events collected by the global thread from the ready list in the epoll instance to the user-provided events buffer. If no events are available in the `revents` field, the function triggers the thread to recheck for events. Epoll_wait returns the total number of ready events. Signed-off-by: Batsheva Black <bblack@nvidia.com>
in case of timeout which causes poll to return, clear all signals that arrived by calling rs_poll_exit. Signed-off-by: Batsheva Black <bblack@nvidia.com>
Keep the list of the fds that are sent to poll in order to know which fd belongs to each rfd when returning the revents to the fds list. Signed-off-by: Batsheva Black <bblack@nvidia.com>
The accept4 implementation extends accept to support the additional atomic flag-setting functionality provided by accept4. Signed-off-by: Batsheva Black <bblack@nvidia.com>
0db3217 to
fbb8d04
Compare
6b5d3bd to
1febaee
Compare
7e6559d to
e3a3c6d
Compare
Add preload interception for fcntl64 so rsocket file descriptors support the same flag semantics as the glibc fcntl64 API. Signed-off-by: Batsheva Black <bblack@nvidia.com>
getsockopt: TCP_INFO, TCP_CONGESTION, SO_BROADCAST & IP_TOS. setsockopt: IP_TOS & TCP_CONGESTION. Signed-off-by: Batsheva Black <bblack@nvidia.com>
rfcntl keeps the files flags all in the fd_flags argument. Adding the new field fs_flags to the rs struct allows the fcntl function to keep the file status flags separately from the file descriptor flags. Signed-off-by: Batsheva Black <bblack@nvidia.com>
Add preload interception for sendfile64 so applications using the 64-bit offset sendfile64 API work correctly with rsocket file descriptors. Signed-off-by: Batsheva Black <bblack@nvidia.com>
Add preload interception for dup so that duplicating an rsocket file descriptor produces another rsocket fd that refers to the same connection. Signed-off-by: Batsheva Black <bblack@nvidia.com>
To allow us to respond to disconnect events initiated by the peer kernel CM, run the connect service always with TCP protocol- also when connection succeeds. Signed-off-by: Batsheva Black <bblack@nvidia.com>
The changes to rpoll to use a signaling fd to wake up blocked threads, combined with suspending polling while rsockets states may be changing _should_ prevent any threads from blocking indefinitely in rpoll() when a desired state change occurs. We periodically wake up any polling thread, so that it can recheck its rsocket states. The sleeping interval was set to an arbitrary value of 5 seconds, this interval is too long for apps that request a connection and are dependent on the thread waking up, so it's changed now to 0.5 seconds, but can be overridden using config files. Signed-off-by: Batsheva Black <bblack@nvidia.com>
Updated type checks to identify socket types even when additional flags are present in the type field. Changed the comparison to use bitwise AND for more accurate detection. Signed-off-by: Batsheva Black <bblack@nvidia.com>
e3a3c6d to
840c6d1
Compare
|
poll support is non-trivial. Is there a reason why epoll support was implemented over rpoll? |
shefty
left a comment
There was a problem hiding this comment.
See comments. I didn't review the epoll code in detail, as I would have expected the implementation to leverage to rpoll() path.
| } | ||
|
|
||
| return rfds_r; | ||
| } |
There was a problem hiding this comment.
Please merge this with fds_alloc(). Have a single malloc allocate both arrays. Make fds_alloc() return an int, with rfds and 'user_fds' output parameters.
The use of "_r" as the prefix doesn't convey what the second array is, making understand the rest of the code confusing. Please rename to something like user_fds.
Finally, the commit message is saying what the change is doing, but not why it's necessary. Please add details why we want this change.
|
|
||
| static void select_to_rpoll(struct pollfd *fds, int *nfds, | ||
| fd_set *readfds, fd_set *writefds, fd_set *exceptfds) | ||
| fd_set *readfds, fd_set *writefds, fd_set *exceptfds, int *fds_r) |
There was a problem hiding this comment.
Here and other places, rename fds_r to user_fds
| * and can respond to disconnect requests | ||
| */ | ||
| rs_notify_svc(&connect_svc, rs, RS_SVC_ADD_CM); | ||
| errno = save_errno; |
There was a problem hiding this comment.
This adds the rsocket to the CM service even in the case where the connect call failed (ret == -1 && errno != EINPROGRESS). Change the original if check to include ret == 0.
The commit message is misleading. This is adding the rsocket to the internal thread used to drive progress. It's not related to the TCP protocol.
|
|
||
| struct rsocket { | ||
| int type; | ||
| int category; |
There was a problem hiding this comment.
This patch would be smaller by removing the flags from the type parameter and keeping rsocket::type as the name that refers to the socket 'type'. Keeping the name of type is also easier for someone to understand, since that's the commonly used term. Readers won't automatically know what 'category means.
The flags that Linux abuses with the type parameter should be stored in the fd_flags, so that they get applied correctly.
Summary
Extend the rsocket implementation in librdmacm so that applications such as Redis, iperf3, and memcached can use rsocket transparently via LD_PRELOAD (librspreload), and so rsocket aligns with more standard Linux socket and I/O behavior.
Motivation
The rsocket library did not fully support several POSIX/Linux interfaces (epoll, select, accept4, sendfile, fcntl64, and various socket options). Applications that rely on these either failed or fell back to TCP. This change extends the rsocket implementation to implement or fix those interfaces, so the preload can intercept them and route traffic over RDMA.
Changes