Implement GSO offload on lightway-server#413
Draft
kp-samuel-tam wants to merge 7 commits intomainfrom
Draft
Conversation
Add the `gso` module to lightway-core with VirtioNetHdr definition, checksum helpers, and segment build/count functions for splitting GSO superpackets into individual segments with correct per-segment header fixups (IP ID, TCP seq, checksums). Also add tun-rs workspace dependency to lightway-core and lightway-server Cargo.toml.
Add the `send_gso` method to the OutsideIOSendCallback trait for sending concatenated wire packets via kernel GSO (UDP_SEGMENT). Include todo!() stub implementations in client TCP/UDP, server TCP, and test harnesses to satisfy the trait contract.
Refactor udp_send into udp_frame + udp_send to enable reuse of wire framing logic. Add gso_buf/gso_size fields to WolfSSLIOAdapter so the wolfssl send() callback can buffer raw encrypted segments during GSO processing. Add udp_send_gso to wrap buffered segments with wire headers and send via send_gso.
Add inside_data_received_gso and send_to_outside_gso methods to Connection. These process a GSO superpacket as a single packet through plugins/encoder, then split into per-segment encrypted frames and collect into a wire buffer for batch send via UDP_SEGMENT.
Add offload config field to TunConfig to enable IFF_VNET_HDR on TUN devices. Add recv_gso for raw reads that include the virtio_net_hdr prefix, and prepend a zeroed virtio header on try_send when offload is enabled.
Extend send_to_socket to accept an optional gso_size parameter and build UDP_SEGMENT cmsg for kernel-level segmentation. Implement the real send_gso on UdpSocket using this path.
Add enable_tun_offload config option and wire it through ServerConfig to main. Extract the default inside IO loop into its own function and add inside_io_loop_gso that reads virtio-framed superpackets from TUN, dispatches GSO vs single-packet paths, and sets gso_max_size on the TUN device.
|
Code coverage summary for 3416250: ✅ Region coverage 67% passes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Implement GSO on server side on DTLS and Expresslane, specifically on bulk server->client traffic. This consistently halves the total syscalls used during bulk transfers, and also improves aggregated server throughput by 2x for multiple clients doing transfers.
When
--enable-tun-offloadis set, the server reads TSO superpackets from the TUN withIFF_VNET_HDR, segments them in userspace, and emits each superpacket as a singlesendmsg(UDP_SEGMENT)instead of N per-segment syscalls. On a single-flow iperf3 reverse test the kernel UDP send path collapses near-completely:udp_sendmsg0.71% → ~0.05%,sock_alloc_send_pskb2.61% → ~0.13%,mlx5e_xmit1.88% → ~0%.Trade-off: kernel work is replaced with userspace work (per-segment IP/TCP/UDP checksum recomputation, segment assembly). Kernel-side wins are clear and measurable; userspace cost is now the dominant factor.
Pacing: each
sendmsg(UDP_SEGMENT)produces a NIC burst of up to N segments. This can exceed receiver socket buffer depth and increase tail drops at peak rates. We will need to revisit better TX pacing under congested links.Future work will focus on compatibility with TUN backends like io_uring, GRO on the server side, and full GSO/GRO on client side, where single-flow workloads should see the biggest visible speedup not in this PR.
Motivation and Context
See ticket CVPN-2346.
How Has This Been Tested?
Types of changes
Checklist:
main