feat(k8s): minimal kubectl launcher mirroring slurm by samsja · Pull Request #2419 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-05T03:35:01Z

Summary

Adds a k8s launcher path that mirrors the existing slurm flow: same RLConfig, new [k8s] block, single Jinja template, one kubectl apply to submit.

The point of this PR is to share the design with the team for early feedback — scope is intentionally small.

How it works

uv run rl @ examples/reverse_text/rl.toml @ examples/reverse_text/k8s_rl.toml

rl_k8s() writes trainer.toml / orchestrator.toml / inference.toml to ./k8s-runs/<job>/configs/ (analog of write_subconfigs in slurm).
Renders templates/rl.k8s.yaml.j2 with those TOMLs inlined as a ConfigMap (data.trainer.toml: | block scalar).
Manifest contains: ConfigMap, PVC, three headless Services, three StatefulSets (trainer / inference / orchestrator). Pods mount the ConfigMap at /etc/prime-rl/configs/ and the PVC at /data.
kubectl apply -f <manifest>.

Trainer rendezvous via <job>-trainer-0.<job>-trainer-headless:29500. Orchestrator gets INFER_URLS env from the headless inference service DNS pattern.

File map (5 changed + 1 gitignore)

file	change
`src/prime_rl/configs/shared.py`	new `K8sConfig`
`src/prime_rl/configs/rl.py`	`k8s` field, mutex with slurm, template auto-load, allow multi-node + k8s
`src/prime_rl/entrypoints/rl.py`	`rl_k8s()`, `write_k8s_manifest()`, dispatch in `rl()`
`src/prime_rl/templates/rl.k8s.yaml.j2`	new template
`examples/reverse_text/k8s_rl.toml`	example user-facing config
`.gitignore`	ignore `k8s-runs/`

Scope cuts (call out for review)

Multi-node only. Single-node still uses rl_local.
One pod per inference replica. No vllm-router fanout, no disaggregated PD. That's where the slurm template grows complexity; deferring until the basic flow lands.
No NCCL/IB tuning. Slurm template auto-detects via ibv_devinfo; on k8s we rely on the GPU operator. May need an env block later.
Existing k8s/ Helm chart left untouched as the legacy hand-driven path. Plan is to delete once this lands and disagg/router are ported.
No live cluster test. Validated locally via yaml.safe_load_all (8 docs render, ConfigMap contains rendered TOMLs, trainer args reference /etc/prime-rl/configs/trainer.toml, INFER_URLS resolves correctly). kubectl --dry-run=client would catch schema issues but isn't installed in my env.

Adds K8sConfig (sibling to SlurmConfig) and rl_k8s dispatch in the RL entrypoint. The launcher writes per-component TOMLs, renders a single multi-doc YAML manifest via Jinja that embeds the TOMLs in a ConfigMap, and `kubectl apply -f`s it. Manifest contains one StatefulSet per role (trainer / inference / orchestrator), headless services for stable DNS, and a shared PVC for outputs. Scope is intentionally small for review: multi-node only, one pod per inference replica (no router fanout), no disagg PD. Existing Helm chart under k8s/ is untouched as the legacy hand-driven path. Usage: uv run rl @ examples/reverse_text/rl.toml @ examples/reverse_text/k8s_rl.toml Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(k8s): minimal kubectl launcher mirroring slurm#2419

feat(k8s): minimal kubectl launcher mirroring slurm#2419
samsja wants to merge 1 commit intomainfrom
feat/k8s-launcher

samsja commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 5, 2026

Summary

How it works

File map (5 changed + 1 gitignore)

Scope cuts (call out for review)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant