Skip to content

feat(k8s): minimal kubectl launcher mirroring slurm#2419

Draft
samsja wants to merge 1 commit intomainfrom
feat/k8s-launcher
Draft

feat(k8s): minimal kubectl launcher mirroring slurm#2419
samsja wants to merge 1 commit intomainfrom
feat/k8s-launcher

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 5, 2026

Summary

Adds a k8s launcher path that mirrors the existing slurm flow: same RLConfig, new [k8s] block, single Jinja template, one kubectl apply to submit.

The point of this PR is to share the design with the team for early feedback — scope is intentionally small.

How it works

uv run rl @ examples/reverse_text/rl.toml @ examples/reverse_text/k8s_rl.toml
  1. rl_k8s() writes trainer.toml / orchestrator.toml / inference.toml to ./k8s-runs/<job>/configs/ (analog of write_subconfigs in slurm).
  2. Renders templates/rl.k8s.yaml.j2 with those TOMLs inlined as a ConfigMap (data.trainer.toml: | block scalar).
  3. Manifest contains: ConfigMap, PVC, three headless Services, three StatefulSets (trainer / inference / orchestrator). Pods mount the ConfigMap at /etc/prime-rl/configs/ and the PVC at /data.
  4. kubectl apply -f <manifest>.

Trainer rendezvous via <job>-trainer-0.<job>-trainer-headless:29500. Orchestrator gets INFER_URLS env from the headless inference service DNS pattern.

File map (5 changed + 1 gitignore)

file change
src/prime_rl/configs/shared.py new K8sConfig
src/prime_rl/configs/rl.py k8s field, mutex with slurm, template auto-load, allow multi-node + k8s
src/prime_rl/entrypoints/rl.py rl_k8s(), write_k8s_manifest(), dispatch in rl()
src/prime_rl/templates/rl.k8s.yaml.j2 new template
examples/reverse_text/k8s_rl.toml example user-facing config
.gitignore ignore k8s-runs/

Scope cuts (call out for review)

  • Multi-node only. Single-node still uses rl_local.
  • One pod per inference replica. No vllm-router fanout, no disaggregated PD. That's where the slurm template grows complexity; deferring until the basic flow lands.
  • No NCCL/IB tuning. Slurm template auto-detects via ibv_devinfo; on k8s we rely on the GPU operator. May need an env block later.
  • Existing k8s/ Helm chart left untouched as the legacy hand-driven path. Plan is to delete once this lands and disagg/router are ported.
  • No live cluster test. Validated locally via yaml.safe_load_all (8 docs render, ConfigMap contains rendered TOMLs, trainer args reference /etc/prime-rl/configs/trainer.toml, INFER_URLS resolves correctly). kubectl --dry-run=client would catch schema issues but isn't installed in my env.

Adds K8sConfig (sibling to SlurmConfig) and rl_k8s dispatch in the RL
entrypoint. The launcher writes per-component TOMLs, renders a single
multi-doc YAML manifest via Jinja that embeds the TOMLs in a ConfigMap,
and `kubectl apply -f`s it. Manifest contains one StatefulSet per role
(trainer / inference / orchestrator), headless services for stable DNS,
and a shared PVC for outputs.

Scope is intentionally small for review: multi-node only, one pod per
inference replica (no router fanout), no disagg PD. Existing Helm chart
under k8s/ is untouched as the legacy hand-driven path.

Usage:
  uv run rl @ examples/reverse_text/rl.toml @ examples/reverse_text/k8s_rl.toml

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant