Skip to content

Bump to pytorch 25.05 container along with TE update#13899

Merged
chtruong814 merged 41 commits intomainfrom
chtruong/bump-pytorch-25-05
Jul 6, 2025
Merged

Bump to pytorch 25.05 container along with TE update#13899
chtruong814 merged 41 commits intomainfrom
chtruong/bump-pytorch-25-05

Conversation

@chtruong814
Copy link
Collaborator

@chtruong814 chtruong814 commented Jun 12, 2025

What does this PR do ?

  • Bump to pytorch 25.05 container along with TE update
  • Also, remove the torch accelerator patch. Doesn't seem necessary given the current version. And update the triton patch. The fix for triton was not in the latest pytorch container.
  • Set two tests as optional for now:
    • L2_VLM_HF_Transformer_PEFT_4bit - automodel test that is failing due to bitsandbytes not compiled for cuda 12.9. Will see if I can resolve later but shouln't be affected in container.
    • Optional_L2_Speech_Batch_Size_OOMptimizer_Canary - this is flaky
  • Added additional import guard for Mcore imports to resolve arm/mac install issues

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions github-actions bot removed the Run CICD label Jul 1, 2025
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
This reverts commit c6c3a76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2025

[🤖]: Hi @chtruong814 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2025

[🤖]: Hi @chtruong814 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@github-actions github-actions bot removed the Run CICD label Jul 6, 2025
@chtruong814 chtruong814 merged commit 0339181 into main Jul 6, 2025
296 checks passed
@chtruong814 chtruong814 deleted the chtruong/bump-pytorch-25-05 branch July 6, 2025 23:58
@chtruong814 chtruong814 added the r2.4.0 Pick this label for auto-cherry-picking into r2.4.0 label Jul 6, 2025
ko3n1g added a commit that referenced this pull request Jul 6, 2025
* Update base container to be pytorch:25.05-py3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE to 2.4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove torch accelerator patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Bump TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* No fail fast

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm to 0.20.0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix test_sched_config_parse_reduce_on_plateau

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add no build isolation to TE

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update manifest

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Enable LoRA for TELinear layers (#13929)"

This reverts commit 7d9f40f.

* update mcore with wd_mult key fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Revert "Revert "Enable LoRA for TELinear layers (#13929)""

This reverts commit 5a1da6c.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix export image build

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove unnecessary sed for torch_tensorrt

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add optional tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure test script arg types are correct for top_p and top_k

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Increase export deploy timeouts

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip failing test_rnnt_logprobs_random after pytorch bump

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip coverage artifact config-3.12.py

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Include more config files ot exclude during coverage

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure top_p is float in nemo_export test script

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set Optional_L2_Speech_Batch_Size_OOMptimizer_Canary to truly be optional

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix top_k and top_p types in megatronllm_deployable

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Skip failing test_rnnt_logprobs_random after pytorch bump"

This reverts commit c6c3a76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix optional export test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert unnecessary changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
chtruong814 added a commit that referenced this pull request Jul 7, 2025
…899)` into `r2.4.0` (#14145)

* Bump to pytorch 25.05 container along with TE update (#13899)

* Update base container to be pytorch:25.05-py3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE to 2.4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove torch accelerator patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Bump TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* No fail fast

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm to 0.20.0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix test_sched_config_parse_reduce_on_plateau

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add no build isolation to TE

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update manifest

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Enable LoRA for TELinear layers (#13929)"

This reverts commit 7d9f40f.

* update mcore with wd_mult key fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Revert "Revert "Enable LoRA for TELinear layers (#13929)""

This reverts commit 5a1da6c.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix export image build

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove unnecessary sed for torch_tensorrt

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add optional tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure test script arg types are correct for top_p and top_k

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Increase export deploy timeouts

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip failing test_rnnt_logprobs_random after pytorch bump

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip coverage artifact config-3.12.py

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Include more config files ot exclude during coverage

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure top_p is float in nemo_export test script

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set Optional_L2_Speech_Batch_Size_OOMptimizer_Canary to truly be optional

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix top_k and top_p types in megatronllm_deployable

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Skip failing test_rnnt_logprobs_random after pytorch bump"

This reverts commit c6c3a76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix optional export test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert unnecessary changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>

* Set L2_NeMo_2_Export_Deploy_Query_In_Framework to be optional (#13946)

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Jul 23, 2025
* Update base container to be pytorch:25.05-py3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE to 2.4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove torch accelerator patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Bump TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* No fail fast

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm to 0.20.0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix test_sched_config_parse_reduce_on_plateau

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add no build isolation to TE

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update manifest

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Enable LoRA for TELinear layers (NVIDIA-NeMo#13929)"

This reverts commit 7d9f40f.

* update mcore with wd_mult key fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Revert "Revert "Enable LoRA for TELinear layers (NVIDIA-NeMo#13929)""

This reverts commit 5a1da6c.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix export image build

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove unnecessary sed for torch_tensorrt

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add optional tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure test script arg types are correct for top_p and top_k

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Increase export deploy timeouts

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip failing test_rnnt_logprobs_random after pytorch bump

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip coverage artifact config-3.12.py

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Include more config files ot exclude during coverage

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure top_p is float in nemo_export test script

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set Optional_L2_Speech_Batch_Size_OOMptimizer_Canary to truly be optional

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix top_k and top_p types in megatronllm_deployable

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Skip failing test_rnnt_logprobs_random after pytorch bump"

This reverts commit c6c3a76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix optional export test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert unnecessary changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Aug 5, 2025
* Update base container to be pytorch:25.05-py3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE to 2.4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove torch accelerator patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Bump TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix triton patch

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* No fail fast

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm to 0.20.0

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix test_sched_config_parse_reduce_on_plateau

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add no build isolation to TE

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update trt-llm dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update manifest

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Enable LoRA for TELinear layers (NVIDIA-NeMo#13929)"

This reverts commit 7d9f40f.

* update mcore with wd_mult key fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Revert "Revert "Enable LoRA for TELinear layers (NVIDIA-NeMo#13929)""

This reverts commit 5a1da6c.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix nemo install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix export image build

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Remove unnecessary sed for torch_tensorrt

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update TE and Mcore commits

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Add optional tests

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix install

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure test script arg types are correct for top_p and top_k

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Increase export deploy timeouts

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip failing test_rnnt_logprobs_random after pytorch bump

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Skip coverage artifact config-3.12.py

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Include more config files ot exclude during coverage

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Update dependencies

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Ensure top_p is float in nemo_export test script

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Set Optional_L2_Speech_Batch_Size_OOMptimizer_Canary to truly be optional

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix top_k and top_p types in megatronllm_deployable

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert "Skip failing test_rnnt_logprobs_random after pytorch bump"

This reverts commit c6c3a76.

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Fix optional export test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

* Revert unnecessary changes

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

---------

Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Amir Hussein <amhussein@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI core Changes to NeMo Core NLP no-fail-fast r2.4.0 Pick this label for auto-cherry-picking into r2.4.0 skip-linting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants