Skip to content

Updating SDE; s3 read; comparison mode; security fix#15500

Open
Jorjeous wants to merge 26 commits intomainfrom
SDE_NC_Afeat
Open

Updating SDE; s3 read; comparison mode; security fix#15500
Jorjeous wants to merge 26 commits intomainfrom
SDE_NC_Afeat

Conversation

@Jorjeous
Copy link
Copy Markdown
Member

@Jorjeous Jorjeous commented Mar 16, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

1)PR adds feature to read from s3 storage
2) upgrade comfort in names compare mode - 2 manifests accepted, order invariant
3) sequrity update - removing picke loading

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

karpnv and others added 18 commits January 27, 2026 18:13
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
… sharding with separate numeration.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Comment thread tools/speech_data_explorer/data_explorer.py Fixed
Comment thread tools/speech_data_explorer/data_explorer.py Fixed
Comment thread tools/speech_data_explorer/data_explorer.py Fixed
@Jorjeous Jorjeous requested review from andrusenkoau and karpnv March 16, 2026 13:12
@Jorjeous
Copy link
Copy Markdown
Member Author

@andrusenkoau plz check when you available

item1 = json.loads(line1)
item2 = json.loads(line2)

if not audio_mismatch_warned and item1.get('audio_filepath') != item2.get('audio_filepath'):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about merging manifests that can have different orders of output files? I had the following problem during my experiment -- I had to compare results from two models that used different decoding scripts (cache-aware vs buffered rnnt decoding). The second script applied input manifest sorting for faster inference by default. The first did not. In this case, I had different file orders in my output manifests. It looks like the current implementation will not process this situation correctly. We need to compare the manifest based on the audio file paths.

This is an example of my notebook code for it:

data_1 = read_manifest(manifest_1)
data_2 = read_manifest(manifest_2)

data_2_map_dict = {}
for idx, item in enumerate(data_2):
    data_2_map_dict[item["audio_filepath"]] = idx

with open(output_manifest, "w", encoding="utf-8") as fn:
    for idx, item in enumerate(data_1):
        
        assert item["audio_filepath"] in data_2_map_dict
        
        item["pred_text_model_1"] = item["pred_text"]
        item["pred_text_model_2"] = data_2[data_2_map_dict[item["audio_filepath"]]]["pred_text"]
        item = json.dumps(item, ensure_ascii=False)
        fn.write(f"{item}\n")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!
That was totally out of scope, will do

Jorjeous and others added 2 commits April 20, 2026 06:38
…my filename in case unordered manifests;

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
@Jorjeous Jorjeous mentioned this pull request Apr 20, 2026
2 tasks
@Jorjeous Jorjeous changed the title Updating SDE comparison mode to accept two manifests Updating SDE; s3 read; comparison mode; security fix Apr 20, 2026
Jorjeous and others added 2 commits April 20, 2026 15:35
@Jorjeous Jorjeous requested a review from andrusenkoau April 20, 2026 22:36
…tart and available options, lazy import for s3 dependancies

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Jorjeous and others added 2 commits April 29, 2026 12:41
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
…expansion

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@blisc
Copy link
Copy Markdown
Collaborator

blisc commented Apr 29, 2026

/ok to test 7a3c00e

@Jorjeous Jorjeous enabled auto-merge (squash) May 1, 2026 08:33
@Jorjeous Jorjeous disabled auto-merge May 1, 2026 08:34
@blisc
Copy link
Copy Markdown
Collaborator

blisc commented May 5, 2026

@Jorjeous where are we with this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants