Updating SDE; s3 read; comparison mode; security fix by Jorjeous · Pull Request #15500 · NVIDIA-NeMo/NeMo

Jorjeous · 2026-03-16T11:10:48Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

1)PR adds feature to read from s3 storage
2) upgrade comfort in names compare mode - 2 manifests accepted, order invariant
3) sequrity update - removing picke loading

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Jorjeous · 2026-03-30T14:23:44Z

@andrusenkoau plz check when you available

andrusenkoau · 2026-04-08T06:13:42Z

+            item1 = json.loads(line1)
+            item2 = json.loads(line2)
+
+            if not audio_mismatch_warned and item1.get('audio_filepath') != item2.get('audio_filepath'):


What do you think about merging manifests that can have different orders of output files? I had the following problem during my experiment -- I had to compare results from two models that used different decoding scripts (cache-aware vs buffered rnnt decoding). The second script applied input manifest sorting for faster inference by default. The first did not. In this case, I had different file orders in my output manifests. It looks like the current implementation will not process this situation correctly. We need to compare the manifest based on the audio file paths.

This is an example of my notebook code for it:

data_1 = read_manifest(manifest_1) data_2 = read_manifest(manifest_2) data_2_map_dict = {} for idx, item in enumerate(data_2): data_2_map_dict[item["audio_filepath"]] = idx with open(output_manifest, "w", encoding="utf-8") as fn: for idx, item in enumerate(data_1): assert item["audio_filepath"] in data_2_map_dict item["pred_text_model_1"] = item["pred_text"] item["pred_text_model_2"] = data_2[data_2_map_dict[item["audio_filepath"]]]["pred_text"] item = json.dumps(item, ensure_ascii=False) fn.write(f"{item}\n")

Thx!
That was totally out of scope, will do

…my filename in case unordered manifests; Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

…tart and available options, lazy import for s3 dependancies Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

copy-pr-bot · 2026-04-29T12:40:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

…expansion Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

blisc · 2026-04-29T13:51:22Z

/ok to test 7a3c00e

blisc · 2026-05-05T16:42:19Z

@Jorjeous where are we with this PR?

karpnv and others added 18 commits January 27, 2026 18:13

read manifest from s3

fe21e7e

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

9069210

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

s3cfg parameter

89a595f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

d67ec95

file range

da895cb

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Apply isort and black reformatting

b79a0da

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

Avoid downloading of full tar, instead extracting specific audio file…

fce458b

…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Apply isort and black reformatting

3e5f4ec

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

391b045

shard_index + 1

64e662f

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

0e043e3

Merge branch 'main' into karpnv/sde_s3

850fd4c

Undo latest changes, as it was dataset specific

69500f6

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into karpnv/sde_s3

ba40e0a

update table to not fail on "non-string format", update bucketing and…

2bae4dc

… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into karpnv/sde_s3

3803d5d

add ability to read two manifests in comparison mode

9ae783f

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Apply isort and black reformatting

39fb4e8

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

github-advanced-security AI found potential problems Mar 16, 2026

View reviewed changes

Comment thread tools/speech_data_explorer/data_explorer.py Fixed

Comment thread tools/speech_data_explorer/data_explorer.py Fixed

Comment thread tools/speech_data_explorer/data_explorer.py Fixed

Jorjeous requested review from andrusenkoau and karpnv March 16, 2026 13:12

andrusenkoau reviewed Apr 8, 2026

View reviewed changes

Jorjeous and others added 2 commits April 20, 2026 06:38

removing picke; --s3cfg=AIS -- read env vars; add matching manifests …

26cbf52

…my filename in case unordered manifests; Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Apply isort and black reformatting

083457a

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

Jorjeous mentioned this pull request Apr 20, 2026

read range of manifests from s3 #15330

Closed

2 tasks

Jorjeous changed the title ~~Updating SDE comparison mode to accept two manifests~~ Updating SDE; s3 read; comparison mode; security fix Apr 20, 2026

Jorjeous and others added 2 commits April 20, 2026 15:35

adding --force flag to load if required fields missing

ca8e494

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Merge branch 'main' into SDE_NC_Afeat

504f3a9

Jorjeous requested a review from andrusenkoau April 20, 2026 22:36

update documentation, guard issues, update requrements, Added quick s…

1fa7012

…tart and available options, lazy import for s3 dependancies Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous and others added 2 commits April 29, 2026 12:41

Apply isort and black reformatting

c3a298d

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

README: document _OP_/_CL_ sharded path syntax and cartesian-product …

25d3b04

…expansion Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

blisc added the skip-linting label Apr 29, 2026

Merge branch 'main' into SDE_NC_Afeat

7a3c00e

copy-pr-bot Bot temporarily deployed to test April 29, 2026 13:53 Inactive

blisc approved these changes Apr 29, 2026

View reviewed changes

Jorjeous enabled auto-merge (squash) May 1, 2026 08:33

Jorjeous disabled auto-merge May 1, 2026 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating SDE; s3 read; comparison mode; security fix#15500

Updating SDE; s3 read; comparison mode; security fix#15500
Jorjeous wants to merge 26 commits intomainfrom
SDE_NC_Afeat

Jorjeous commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous commented Mar 30, 2026

Uh oh!

andrusenkoau Apr 8, 2026

Uh oh!

Jorjeous Apr 8, 2026

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

blisc commented Apr 29, 2026

Uh oh!

blisc commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Jorjeous commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous commented Mar 30, 2026

Uh oh!

andrusenkoau Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Jorjeous Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

blisc commented Apr 29, 2026

Uh oh!

blisc commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jorjeous commented Mar 16, 2026 •

edited

Loading