Updating SDE; s3 read; comparison mode; security fix#15500
Updating SDE; s3 read; comparison mode; security fix#15500
Conversation
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
…. Updaetd logging system Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
… sharding with separate numeration. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
|
@andrusenkoau plz check when you available |
| item1 = json.loads(line1) | ||
| item2 = json.loads(line2) | ||
|
|
||
| if not audio_mismatch_warned and item1.get('audio_filepath') != item2.get('audio_filepath'): |
There was a problem hiding this comment.
What do you think about merging manifests that can have different orders of output files? I had the following problem during my experiment -- I had to compare results from two models that used different decoding scripts (cache-aware vs buffered rnnt decoding). The second script applied input manifest sorting for faster inference by default. The first did not. In this case, I had different file orders in my output manifests. It looks like the current implementation will not process this situation correctly. We need to compare the manifest based on the audio file paths.
This is an example of my notebook code for it:
data_1 = read_manifest(manifest_1)
data_2 = read_manifest(manifest_2)
data_2_map_dict = {}
for idx, item in enumerate(data_2):
data_2_map_dict[item["audio_filepath"]] = idx
with open(output_manifest, "w", encoding="utf-8") as fn:
for idx, item in enumerate(data_1):
assert item["audio_filepath"] in data_2_map_dict
item["pred_text_model_1"] = item["pred_text"]
item["pred_text_model_2"] = data_2[data_2_map_dict[item["audio_filepath"]]]["pred_text"]
item = json.dumps(item, ensure_ascii=False)
fn.write(f"{item}\n")
There was a problem hiding this comment.
Thx!
That was totally out of scope, will do
…my filename in case unordered manifests; Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
…tart and available options, lazy import for s3 dependancies Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
…expansion Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
/ok to test 7a3c00e |
|
@Jorjeous where are we with this PR? |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
1)PR adds feature to read from s3 storage
2) upgrade comfort in names compare mode - 2 manifests accepted, order invariant
3) sequrity update - removing picke loading
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information