Skip to content

feature: prime train routes full_finetune TOMLs to hosted endpoint#592

Open
JannikSt wants to merge 9 commits intomainfrom
feature/train-hosted-full-finetune
Open

feature: prime train routes full_finetune TOMLs to hosted endpoint#592
JannikSt wants to merge 9 commits intomainfrom
feature/train-hosted-full-finetune

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented May 2, 2026

Note

Medium Risk
Changes how prime train dispatches and deletes runs by introducing a second backend endpoint and typed 404 handling; mis-detection or error mapping could route jobs to the wrong API or change delete behavior.

Overview
prime train now detects full fine-tune TOMLs (type = "full_finetune" or a [deployment] GPU block) and dispatches them to a new hosted endpoint (POST /v1/training/runs) instead of the existing LoRA/shared RL endpoint.

Adds a new HostedTrainingClient + payload builder for full-FT, wires secrets from env/env-files, and intentionally suppresses the returned per-run token from CLI output.

Run deletion now tries the hosted full-FT delete endpoint first and falls back to the existing RL delete path on a typed NotFoundError; the core HTTP client now raises NotFoundError for 404s, and RLRun gains an optional kind discriminator for forward/backward compatibility.

Reviewed by Cursor Bugbot for commit 28fdcd2. Bugbot is set up for automated code reviews on this repo. Configure here.

JannikSt added 3 commits May 1, 2026 11:28
…o hosted endpoint

`prime train <toml>` stays the single entry point. When the TOML carries
`type = "full_finetune"` (or a `[hosted]` block, or a `[deployment]`
block matching prime-rl's qwen30b_math/rl.toml shape), the CLI routes
to the new public API at /api/v1/training/runs instead of the LoRA
shared-cluster path. Backwards compatible: configs without these
markers run through the existing flow unchanged.

* api/training.py: new HostedTrainingClient + build_payload_from_toml
  (whitelist-maps prime-rl example fields onto the API payload).
* api/rl.py: surface `kind` on RLRun so `prime train delete` can route
  to the right endpoint based on run kind.
* commands/rl.py: peek the TOML before strict RLConfig parse; on full-FT
  hand off to _dispatch_full_finetune_run with shared env-file/secrets
  plumbing. Delete looks up kind and dispatches via HostedTrainingClient
  for DEDICATED_FULL_FT runs.

Tested end-to-end against local backend on rft-freyr cluster: dispatch
+ status mirroring + completion + delete all clean.
Drop the friction of looking up a cluster cuid for the common single-
cluster setup. Backend now auto-picks the first uncordoned PrimeCluster
when the field is omitted, so `prime train backend/examples/training/
reverse_text.toml` is zero-config.
…eyRef

The platform materialises the per-run RFT API token into the per-run
k8s Secret on dispatch and the chart binds it to the orchestrator pod's
PRIME_API_KEY env var. The token already lives where prime-monitor
needs it — surfacing it on stdout just makes it easy to leak into
shared shell history or CI logs.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b88b5f7d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
JannikSt added 3 commits May 2, 2026 16:45
- Drop prime_cluster_id from CLI plumbing entirely. Backend auto-picks
  the first uncordoned PrimeCluster; the CLI never threads a cluster
  id, removing a footgun (mistargeting a config to the wrong cluster).
  Drops the [hosted] discriminator path too — type = 'full_finetune'
  or a [deployment] block remain the only triggers.

- codex P1 / cursor: --output json on the full-FT path no longer
  short-circuits with a {would_dispatch} preview. Mirrors the LoRA
  path's 'create then format' contract — automation that pipes the
  JSON to grab run_id now actually dispatches the run.

- codex P2: env_file (deprecated, singular) is loaded BEFORE env_files
  (canonical, plural) so env_files wins on key collision. Matches the
  LoRA path's documented precedence.
build_payload_from_toml used to whitelist ~12 individual fields; the
backend rebuilt a minimal TOML from them, dropping anything outside
the whitelist (custom optim schedules, eval configs, custom scheduler
params, …). E2E prime-rl runs that depended on those knobs silently
diverged from `uv run rl @ rl.toml` behaviour.

Now: ship the whole TOML as 'config' (companion to platform PR #1824
faa934d56). The backend's build_values takes the config dict directly
so the same TOML works locally with prime-rl and remote-dispatched
through prime-cli with no fork in behaviour.
Unrelated to the full-FT training payload change; accidentally picked
up by 'git add -A'. Restoring inference.py to its 4b9be8f state. The
streaming improvements will land in a sibling PR off main.
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Cursor caught: when rl_client.get_run fails (e.g. pydantic
ValidationError because a DEDICATED_FULL_FT row doesn't carry the
LoRA-required RLRun fields), the prior except APIError block silently
set kind=None and routed delete to the LoRA endpoint. The hosted
helm release + namespace would stay live with no signal back.

Restructure: try hosted_client.delete_run first; on HTTP 404 the
backend's kind gate told us 'not a DEDICATED_FULL_FT', so we fall
through to the LoRA path. Removes the get_run discriminator
roundtrip entirely — and any pydantic surprise it could have raised.
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Cursor flagged: distinguishing 404 by 'HTTP 404' not in str(e) is
fragile — depends on the message format never changing AND on no
unrelated error body coincidentally containing the substring.

- Add NotFoundError(APIError) subclass; APIClient raises it for 404
  responses (sync + async paths).
- HostedTrainingClient.create_run / delete_run no longer catch
  Exception and rewrap into a generic APIError — typed APIError
  subclasses propagate so callers can branch by class.
- prime train delete uses except NotFoundError as the 'not a hosted
  run, try LoRA' fallback signal.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2e1688d. Configure here.

Comment thread packages/prime/src/prime_cli/commands/rl.py
Cursor caught: `if output != "json" and not yes` meant a researcher
piping JSON for scripting would have full-FT runs auto-launched
without ack. The LoRA path always prompts unless --yes regardless of
output format. Match that contract — gate confirmation purely on
--yes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant