refactor: CLI app and launching by akoumpa · Pull Request #1406 · NVIDIA-NeMo/Automodel

akoumpa · 2026-02-27T14:49:07Z

What does this PR do ?

Previously, running a job could be accomplished with three different ways:

using a script (recipe shim) under examples/<domain>_<task>/<task>.py -c config.yaml
using the recipe directly e.g., nemo_automodel/recipes/<domain>_<task>/<task>.py -c config.yaml
using the cli application e.g., automodel llm finetune -c config.yaml

The first two ways to run a job, provide direct access to the script being run, however, it's still too long to type, and similarly the automodel CLI had to use the "llm finetune" part to specify the recipe.

This PR refactors the job launching, and now, config.yaml lists the recipe to use. As a result, we can now run automodel config.yaml which is much easier to type than before. In addition, the python scripts under example/* will now be deprecated and instead users can use the automodel/am CLI application. Removal will happen in 26.06, with 26.04 still containing them with a deprecation message.

Example launch commands

Launch command	What happens
`python3 app.py config.yaml --nproc-per-node 8`	LOCAL_RANK not set → detects 8 GPUs → launches torchrun internally → workers run the recipe script directly
`python3 app.py config.yaml` (single GPU)	LOCAL_RANK not set → detects 1 GPU → runs recipe in-process
`torchrun --nproc-per-node 8 app.py config.yaml`	torchrun spawns 8 workers → each enters app.py → LOCAL_RANK is set → runs recipe in-process (no nested torchrun)

The recommended command is to use app.py with python3, however, to avoid bad ux launching with torchrun is also supported.

Changelog

Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot · 2026-02-27T14:49:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

oot cause: nemo-run pins cryptography<43.0.0, but your project constrains cryptography>=46.0.5 (CVE fix). These are fundamentally incompatible, so uv lock can't resolve any extra that depends on nemo_run. Fix: Removed nemo_run from the cli extra -- it's now just ["pyyaml"] Removed the standalone nemo-run extra entirely (it was also unresolvable) Updated BREAKING_CHANGES.md, installation guide, and cluster guide to note that nemo-run should be installed separately (pip install nemo-run) if needed The SLURM and k8s launchers don't need nemo-run at all (they shell out to sbatch/kubectl), and the NemoRunLauncher already does a runtime import check, so users who need it just install nemo-run directly.

akoumpa · 2026-02-27T14:57:50Z

/ok to test 203a148

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-02-27T15:12:58Z

/ok to test 150e717

akoumpa · 2026-03-30T02:25:49Z

/ok to test fc5ea89

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-30T02:30:57Z

/ok to test 00ca581

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-30T04:02:08Z

/ok to test 550c136

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-30T06:10:41Z

/ok to test d208597

adil-a · 2026-03-30T20:05:18Z

/claude review

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

akoumpa · 2026-03-30T20:15:21Z

/ok to test c507d8e

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-30T21:40:05Z

/ok to test 5c6ccaa

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-30T22:24:23Z

/ok to test 164e4f1

akoumpa · 2026-03-30T22:52:03Z

/claude review

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-31T22:13:46Z

/ok to test 84ae1f8

adil-a

LGTM! Thanks a lot

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-04-01T05:55:19Z

/ok to test 85da4ee

akoumpa · 2026-04-01T07:21:14Z

already approved #1406, only change was in 85da4ee (minor) -> FM.

akoumpa added 7 commits February 27, 2026 06:45

refactor CLI app

150edc6

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

update tests

0be168d

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

add breakin changes log

bba966b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

add deprecation messages

289e1b6

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

update recipes

1b66802

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

update readme

d045c60

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

update docs

fcba611

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa added 2 commits February 27, 2026 06:54

add launch

0f93f7b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa force-pushed the akoumparouli/refactor_app_and_launching branch from d5cd04a to 944138c Compare February 27, 2026 14:57

copy-pr-bot bot temporarily deployed to test February 27, 2026 14:58 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 27, 2026 14:58 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 27, 2026 15:05 Error

update examples and messages

ffdadc3

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa force-pushed the akoumparouli/refactor_app_and_launching branch from 203a148 to ffdadc3 Compare February 27, 2026 15:09

akoumpa and others added 2 commits February 27, 2026 15:09

Update uv lock

9524b9f

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

mention uv run am in docs

150e717

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 27, 2026 15:13 Inactive

fix

00ca581

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

deprecate k8s in favor of skypilot

550c136

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

fix test

d208597

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

claude bot reviewed Mar 30, 2026

View reviewed changes

Comment thread nemo_automodel/components/launcher/slurm/launcher.py Outdated

claude bot reviewed Mar 30, 2026

View reviewed changes

Comment thread examples/llm_finetune/finetune.py Outdated

claude bot reviewed Mar 30, 2026

View reviewed changes

Comment thread nemo_automodel/components/launcher/interactive.py Outdated

akoumpa and others added 3 commits March 30, 2026 13:13

Update nemo_automodel/components/launcher/slurm/launcher.py

6c4d6c3

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Update nemo_automodel/components/launcher/interactive.py

56018cb

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Update examples/llm_finetune/finetune.py

c507d8e

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

akoumpa added 3 commits March 30, 2026 14:07

keep only slurm.sub and remove launcher

b9539f7

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Merge branch 'main' into akoumparouli/refactor_app_and_launching

b94344c

add recipe field

5c6ccaa

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

exclude

164e4f1

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

claude bot reviewed Mar 30, 2026

View reviewed changes

Comment thread nemo_automodel/components/launcher/interactive.py

akoumpa and others added 2 commits March 31, 2026 15:09

consistency is key

5231e0f

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Merge branch 'main' into akoumparouli/refactor_app_and_launching

84ae1f8

adil-a previously approved these changes Apr 1, 2026

View reviewed changes

adding missing recipe

85da4ee

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Conversation

akoumpa commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Example launch commands

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

akoumpa commented Feb 27, 2026

Uh oh!

akoumpa commented Feb 27, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

adil-a commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

akoumpa commented Mar 30, 2026

Uh oh!

Uh oh!

akoumpa commented Mar 31, 2026

Uh oh!

adil-a left a comment

Choose a reason for hiding this comment

Uh oh!

akoumpa commented Apr 1, 2026

Uh oh!

akoumpa commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akoumpa commented Feb 27, 2026 •

edited

Loading