Add embedding recipe: build domain-specific embeddings from raw documents#85
Open
oliverholworthy wants to merge 10 commits intomainfrom
Open
Add embedding recipe: build domain-specific embeddings from raw documents#85oliverholworthy wants to merge 10 commits intomainfrom
oliverholworthy wants to merge 10 commits intomainfrom
Conversation
e90e6b4 to
f0c63a4
Compare
f0c63a4 to
4fd435d
Compare
marcromeyn
reviewed
Mar 24, 2026
src/nemotron/recipes/embed/sample_data/nv_pp_random/corporateblog/43679
Outdated
Show resolved
Hide resolved
4fd435d to
67e75af
Compare
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Move detailed documentation from the recipe README into docs/nemotron/embed/ to follow the nano3/super3 pattern. Add grid card and toctree entry in docs/index.md. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Remove bundled sample data from the repo and download it on demand from HuggingFace (nvidia/Retrieval-Synthetic-NVDocs-v1). The SDG stage now supports hf:// URIs in corpus_dir config, e.g.: hf://nvidia/Retrieval-Synthetic-NVDocs-v1@<sha>/sample_corpus/nv_pp_random This keeps the repo lightweight while preserving zero-config quick start — the default config auto-downloads the sample corpus on first run. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- conftest.py: Fix cli_module paths from nemotron.cli.embed.* to nemotron.cli.commands.embed.* to match actual module locations - test_config_models.py: Provide required sdg_input_path for DataPrepConfig tests that construct with defaults - test_module_exports.py: Update expected exports to match actual API (SCRIPT_PATH, SPEC, META instead of SCRIPT_LOCAL, CONFIG_DIR, etc.) Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
9f03218 to
2e23ca7
Compare
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- finetune.py: use entrypoint='python' (was 'python3') to match all other embed commands and the nano3/super3 pattern - sdg.py: remove unused for_remote variable Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
be7f927 to
dd66ea2
Compare
- stage1_data_prep: use context manager for open() to avoid file handle leak - stage4_export: remove unused `from functools import partial` inside export_to_onnx - stage4_export: restore torch.onnx.export after patching using try/finally - docs: update corpus_dir default in README to match actual default.yaml HF URI Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- Remove uv.lock from .gitignore and track stage lock files for reproducible builds - Revert AutoTokenizer monkey-patch in eval stage, no longer needed after upstream checkpoint fix (rope_theta added to rope_scaling) Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
…cipe - Add 'Using NVIDIA's Pre-Generated Dataset' section showing how to skip Stage 0 using nvidia/Retrieval-Synthetic-NVDocs-v1 - Add dataset link to Further Reading Signed-off-by: root <1216955+oliverholworthy@users.noreply.github.com>
dd66ea2 to
f23683b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nemotron embed(sdg, prep, finetune, eval, export, deploy, run)nemo_runspecfor local-docker execution--helpTest plan
nemotron embed finetune --run local-dockerlaunches and streams logsnemotron embed finetune --dry-runshows config without executingnemotron embed finetune --helpdisplays config options from pydantic modelpytest tests/recipes/embed/nemotron nano3commands