Skip to content

Add embedding recipe: build domain-specific embeddings from raw documents#85

Open
oliverholworthy wants to merge 10 commits intomainfrom
oholworthy/embed-recipe
Open

Add embedding recipe: build domain-specific embeddings from raw documents#85
oliverholworthy wants to merge 10 commits intomainfrom
oholworthy/embed-recipe

Conversation

@oliverholworthy
Copy link
Copy Markdown

@oliverholworthy oliverholworthy commented Mar 10, 2026

Summary

  • Adds a complete embedding recipe with 6 stages: SDG, data prep, fine-tuning, evaluation, export, and deployment
  • Includes CLI commands under nemotron embed (sdg, prep, finetune, eval, export, deploy, run)
  • Adds docker executor support to nemo_runspec for local-docker execution
  • Adds pydantic-based config loading and config model introspection for --help
  • Includes sample data, tests, and comprehensive README documentation

Test plan

  • nemotron embed finetune --run local-docker launches and streams logs
  • nemotron embed finetune --dry-run shows config without executing
  • nemotron embed finetune --help displays config options from pydantic model
  • Unit tests pass: pytest tests/recipes/embed/
  • No regressions in existing nemotron nano3 commands

@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from e90e6b4 to f0c63a4 Compare March 10, 2026 17:58
@oliverholworthy oliverholworthy self-assigned this Mar 10, 2026
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from f0c63a4 to 4fd435d Compare March 10, 2026 18:00
@oliverholworthy oliverholworthy changed the title Add embedding recipe for fine-tuning, evaluation, and deployment Add embedding recipe: build domain-specific embeddings from raw documents Mar 11, 2026
@bernardwin bernardwin requested a review from marcromeyn March 23, 2026 18:39
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from 4fd435d to 67e75af Compare March 25, 2026 20:54
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Move detailed documentation from the recipe README into
docs/nemotron/embed/ to follow the nano3/super3 pattern.
Add grid card and toctree entry in docs/index.md.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Remove bundled sample data from the repo and download it on demand from
HuggingFace (nvidia/Retrieval-Synthetic-NVDocs-v1). The SDG stage now
supports hf:// URIs in corpus_dir config, e.g.:

  hf://nvidia/Retrieval-Synthetic-NVDocs-v1@<sha>/sample_corpus/nv_pp_random

This keeps the repo lightweight while preserving zero-config quick start
— the default config auto-downloads the sample corpus on first run.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- conftest.py: Fix cli_module paths from nemotron.cli.embed.* to
  nemotron.cli.commands.embed.* to match actual module locations
- test_config_models.py: Provide required sdg_input_path for
  DataPrepConfig tests that construct with defaults
- test_module_exports.py: Update expected exports to match actual API
  (SCRIPT_PATH, SPEC, META instead of SCRIPT_LOCAL, CONFIG_DIR, etc.)

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from 9f03218 to 2e23ca7 Compare March 30, 2026 11:09
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- finetune.py: use entrypoint='python' (was 'python3') to match all other
  embed commands and the nano3/super3 pattern
- sdg.py: remove unused for_remote variable

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from be7f927 to dd66ea2 Compare March 30, 2026 13:31
@oliverholworthy oliverholworthy marked this pull request as ready for review March 30, 2026 14:08
- stage1_data_prep: use context manager for open() to avoid file handle leak
- stage4_export: remove unused `from functools import partial` inside export_to_onnx
- stage4_export: restore torch.onnx.export after patching using try/finally
- docs: update corpus_dir default in README to match actual default.yaml HF URI

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- Remove uv.lock from .gitignore and track stage lock files for
  reproducible builds
- Revert AutoTokenizer monkey-patch in eval stage, no longer needed
  after upstream checkpoint fix (rope_theta added to rope_scaling)

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
…cipe

- Add 'Using NVIDIA's Pre-Generated Dataset' section showing how to skip
  Stage 0 using nvidia/Retrieval-Synthetic-NVDocs-v1
- Add dataset link to Further Reading

Signed-off-by: root <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-recipe branch from dd66ea2 to f23683b Compare March 30, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants