Conversation
These files were on the branch from prior unrelated work (commit 40ff419, plus sonic_ep.py accidentally staged with the config scaffold). Saved to ../scratchpad/nixl_mx_unrelated/ for future reapplication.
Signed-off-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
Signed-off-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
| @torch.no_grad() | ||
| def update_weights_from_path(self, weight_dir: str | None = None) -> None: | ||
| """Block until the trainer's RDMA push completes, then recompute the MLA absorbed weights and return, orchestrator can then call `/resume`""" | ||
| self.rendezvous.wait_for_all_peers_ready(timeout=1200) |
There was a problem hiding this comment.
Hardcoded timeout ignores configurable timeout setting
Medium Severity
update_weights_from_path calls self.rendezvous.wait_for_all_peers_ready(timeout=1200) with a hardcoded 1200-second timeout. The trainer and orchestrator both use a configurable timeout from their respective configs. If a user configures a different timeout (e.g., for very large models needing longer), the inference worker won't respect it and could time out prematurely or wait too long.
Reviewed by Cursor Bugbot for commit 9d9b2cc. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 75efd8c. Configure here.
| ) | ||
| self.checkpoint_ready.clear() | ||
| wait_for_ckpt_start_time = time.perf_counter() | ||
| await wait_for_path(get_step_path(get_broadcast_dir(self.config.output_dir), next_ckpt_step) / "STABLE") |
There was a problem hiding this comment.
Scheduler filesystem probe broken for NIXL+MX mode
Medium Severity
_compute_next_ckpt_step always calls get_latest_ckpt_step on the filesystem broadcast directory, but NIXL+MX never writes checkpoint files there. This means latest_ckpt_step is always 0, forcing the scheduler into effectively strict-async behavior regardless of the strict_async_level setting. In non-strict mode with filesystem/NCCL, the orchestrator can jump ahead to the latest available checkpoint; this optimization is silently lost with NIXL+MX.
Reviewed by Cursor Bugbot for commit 75efd8c. Configure here.
| self.rendezvous.set_status(p2p_pb2.SOURCE_STATUS_READY) | ||
| self.logger.debug(f"NIXL+MX push completed in {time.perf_counter() - start:.2f}s") | ||
|
|
||
| dist.barrier() |
There was a problem hiding this comment.
Non-primary HSDP ranks skip lazy_init but hit barrier deadlock risk
Medium Severity
In broadcast_weights, when dp_replicate > 1, the master rank (rank 0) calls self.rendezvous.wait_for_all_peers_ready which blocks synchronously waiting for the orchestrator, while all other ranks (including non-primary HSDP ranks) proceed directly to dist.barrier(). If the orchestrator is slow to signal, rank 0 blocks before the barrier while other ranks are already waiting at the barrier. This is fine structurally, but wait_for_all_peers_ready is a synchronous blocking call that holds the GIL and busy-polls with time.sleep(0.05), which could delay the barrier on large clusters.
Reviewed by Cursor Bugbot for commit 75efd8c. Configure here.


Consolidates all the existing PRs for NIXl + MX into a clean implementation where I actually watched my agents
Note
High Risk
Introduces a new RDMA-based weight transfer backend and changes orchestrator/trainer/inference coordination logic, which can affect distributed training stability and correctness. Also adds new runtime dependencies (Model Express, protobuf) and Docker-launched services that may impact deployment and CI.
Overview
Enables a new
nixl_mxweight broadcast option that pushes weights trainer→inference via NIXL RDMA while using Model Express (MX) for rendezvous/metadata, wiring this through configs, orchestrator init, and inference worker extensions.Implements the NIXL+MX transport stack: new vLLM worker (
NIXLMxWeightUpdateWorker) and/init_nixl_mxendpoint, trainer-sideNIXLMxWeightBroadcastwith slot-based buffer allocation + conversion (bf16 passthrough / FP8 blockwise) and aTransportPlanthat posts RDMA WRITEs based on MX-published tensor descriptors and expert maps.Updates SLURM templates to optionally launch the MX + Redis stack via
docker compose, sets UCX/LD paths for RDMA, and adjusts orchestrator scheduling/weight update flow to use MX status handshakes instead of filesystem markers when innixl_mxmode. Adds unit/integration tests for MX rendezvous and conversion/slot planning, and addsmodelexpress/protobufdependencies plus a tilelang CUDA runtime preload shim.Reviewed by Cursor Bugbot for commit dabaa19. Bugbot is set up for automated code reviews on this repo. Configure here.