Skip to content

cp: docs: Adding dtensor TP debugging summary (1767) into r0.5.0#1777

Merged
yuki-97 merged 1 commit into
r0.5.0from
cherry-pick-1767-r0.5.0
Jan 15, 2026
Merged

cp: docs: Adding dtensor TP debugging summary (1767) into r0.5.0#1777
yuki-97 merged 1 commit into
r0.5.0from
cherry-pick-1767-r0.5.0

Conversation

@chtruong814
Copy link
Copy Markdown
Contributor

@chtruong814 chtruong814 commented Jan 15, 2026

beep boop [🤖]: Hi @joyang-nv 👋,

we've cherry picked #1767 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive guide on tensor parallelism accuracy for reinforcement learning training, including analysis of training-inference misalignments, performance degradation mechanisms, root cause investigation of cross-device reductions and kernel batching effects, strategy comparisons, practical mitigation recommendations with code examples, and supporting performance metrics.

✏️ Tip: You can customize this high-level summary in your review settings.

@chtruong814 chtruong814 requested a review from a team as a code owner January 15, 2026 01:38
@chtruong814 chtruong814 requested a review from joyang-nv January 15, 2026 01:38
@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Jan 15, 2026
@yuki-97 yuki-97 added the CI:docs Run doctest label Jan 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

A new comprehensive documentation guide detailing accuracy challenges encountered with DTensor tensor parallelism in RL training, covering root causes such as batch-variant kernels and cross-device reductions. The guide includes mitigation strategies and concrete code examples. Documentation index updated to include the new guide.

Changes

Cohort / File(s) Summary
Documentation Addition
docs/guides/dtensor-tp-accuracy.md
New comprehensive guide documenting numerical accuracy issues with DTensor tensor parallelism, including analysis of root causes (batch-variant kernels, row-wise sharding), observed phenomena (token\_mult\_prob\_error spikes, reward-model discrepancies), and concrete mitigation recommendations
Documentation Index
docs/index.md
Updated Guides toctree to include the new dtensor-tp-accuracy guide

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

documentation, CI:docs, r0.5.0

Suggested reviewers

  • RayenTian
  • yuki-97
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly indicates the main change: adding a dtensor TP debugging/accuracy summary document to the r0.5.0 branch via cherry-pick.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed This is a documentation-only PR with no code changes affecting numerics, convergence, or performance. The custom check targets major code changes that could introduce regressions, but this PR merely adds a comprehensive guide documenting observed accuracy issues with DTensor tensor parallelism.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/guides/dtensor-tp-accuracy.md`:
- Line 67: The <img> tag referencing
"../assets/dtensor-tp-accuracy/validation_accuracy.png" is missing alt text; add
a descriptive alt attribute to the image element (e.g., alt="Validation accuracy
over training steps for DTensor tensor-parallel experiment") so screen readers
and accessibility tools can convey the image content; update the <img
src="../assets/dtensor-tp-accuracy/validation_accuracy.png" ... /> element to
include this alt attribute.
- Line 95: The image markdown line for the figure
("![](../assets/dtensor-tp-accuracy/logprobs_unequal_1.png)") is missing alt
text; update that markdown to include a concise descriptive alt string (for
example: alt="Plot of log-probabilities showing unequal values across tensor
parallel partitions" or similar) so screen readers can convey the image content
and purpose.
- Around line 47-51: The markdown table block starting with the line beginning
"|               | TP=1   | TP=2..." and ending with the "<p
align="center"><em>Table 1: The validation loss of reward model
training</em></p>" line needs a blank line inserted immediately before the table
and another blank line immediately after the caption paragraph so the table and
its caption are surrounded by empty lines for correct Markdown rendering.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0cced7 and 8efd68d.

⛔ Files ignored due to path filters (6)
  • docs/assets/dtensor-tp-accuracy/image-20260111142255534.png is excluded by !**/*.png
  • docs/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.png is excluded by !**/*.png
  • docs/assets/dtensor-tp-accuracy/kl_hf_prev.png is excluded by !**/*.png
  • docs/assets/dtensor-tp-accuracy/logprobs_unequal_1.png is excluded by !**/*.png
  • docs/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.png is excluded by !**/*.png
  • docs/assets/dtensor-tp-accuracy/validation_accuracy.png is excluded by !**/*.png
📒 Files selected for processing (2)
  • docs/guides/dtensor-tp-accuracy.md
  • docs/index.md
🧰 Additional context used
📓 Path-based instructions (2)
docs/**/*.md

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section

Files:

  • docs/index.md
  • docs/guides/dtensor-tp-accuracy.md
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • docs/index.md
  • docs/guides/dtensor-tp-accuracy.md
🪛 LanguageTool
docs/guides/dtensor-tp-accuracy.md

[style] ~13-~13: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...under different TP configurations. 3. For overall model training performance: U...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~179-~179: Consider using a different verb to strengthen your wording.
Context: ...\right]$. ### Root Cause Our analysis shows that the **row-wise (colwise) tensor pa...

(SHOW_INDICATE)

🪛 markdownlint-cli2 (0.18.1)
docs/guides/dtensor-tp-accuracy.md

50-50: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


67-67: Images should have alternate text (alt text)

(MD045, no-alt-text)


95-95: Images should have alternate text (alt text)

(MD045, no-alt-text)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: build-container / main
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
docs/guides/dtensor-tp-accuracy.md (1)

1-242: Excellent comprehensive documentation on DTensor TP accuracy issues.

This guide provides thorough coverage of tensor parallelism accuracy challenges in RL training, including:

  • Well-documented observed issues with concrete examples and metrics
  • Clear mathematical explanations of row-wise vs column-wise parallelism
  • Actionable mitigation strategies with code examples
  • Proper context with references to external research

The technical content is accurate, the structure is logical, and the explanations are detailed enough for practitioners to understand and apply the solutions.

docs/index.md (1)

219-219: Documentation index correctly updated.

The new DTensor TP accuracy guide has been properly added to the Guides section of the documentation index, satisfying the coding guideline requirement to update docs/index.md when adding new markdown documentation.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +47 to +51
| | TP=1 | TP=2 | TP=4 | TP=8 |
| ------------- | ------ | ------ | ------ | ------ |
| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
<p align="center"><em>Table 1: The validation loss of reward model training</em></p>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the table for proper markdown formatting.

Table 1 should be surrounded by blank lines according to markdown style guidelines. This ensures consistent rendering across different markdown parsers.

📝 Proposed fix
 the discrepancy. Instead, these results imply that the primary source of inconsistency lies in how different TP plans partition and aggregate computations across devices, rather than in precision loss from autocast.
 
 |               | TP=1   | TP=2   | TP=4   | TP=8   |
 | ------------- | ------ | ------ | ------ | ------ |
 | With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
 | W/O autocast  | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
+
 <p align="center"><em>Table 1: The validation loss of reward model training</em></p>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| | TP=1 | TP=2 | TP=4 | TP=8 |
| ------------- | ------ | ------ | ------ | ------ |
| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
<p align="center"><em>Table 1: The validation loss of reward model training</em></p>
| | TP=1 | TP=2 | TP=4 | TP=8 |
| ------------- | ------ | ------ | ------ | ------ |
| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
<p align="center"><em>Table 1: The validation loss of reward model training</em></p>
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

50-50: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` around lines 47 - 51, The markdown table
block starting with the line beginning "|               | TP=1   | TP=2..." and
ending with the "<p align="center"><em>Table 1: The validation loss of reward
model training</em></p>" line needs a blank line inserted immediately before the
table and another blank line immediately after the caption paragraph so the
table and its caption are surrounded by empty lines for correct Markdown
rendering.


<p align="center"><em>Fig 2: The reward of Qwen3-4B</em></p>

<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" />
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add alt text to the image for accessibility.

The image reference is missing alt text, which is important for accessibility and screen readers.

♿ Proposed fix
-<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" />
+<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" alt="Validation accuracy comparison for Qwen3-4B under different TP configurations" style="zoom:33%;" />
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" />
<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" alt="Validation accuracy comparison for Qwen3-4B under different TP configurations" style="zoom:33%;" />
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

67-67: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` at line 67, The <img> tag referencing
"../assets/dtensor-tp-accuracy/validation_accuracy.png" is missing alt text; add
a descriptive alt attribute to the image element (e.g., alt="Validation accuracy
over training steps for DTensor tensor-parallel experiment") so screen readers
and accessibility tools can convey the image content; update the <img
src="../assets/dtensor-tp-accuracy/validation_accuracy.png" ... /> element to
include this alt attribute.


This ratio is the standard importance ratio used in off-policy RL to reweight returns when the data are collected under an older behavior policy. In on-policy training, this ratio should be exactly 1. However, in our experiments, we observed cases where the ratio deviates from 1, indicating a mismatch between the intended on-policy setting and the actual behavior of the system. Figure 4 and Figure 5 illustrate this phenomenon by showing the mismatch between `prev_logprobs` and `current_logprobs` under TP=4, as well as the reward curves under TP=4 and TP=1 for the `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` model.

![](../assets/dtensor-tp-accuracy/logprobs_unequal_1.png)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add alt text to the image for accessibility.

The image reference is missing alt text, which is important for accessibility and screen readers.

♿ Proposed fix
-![](../assets/dtensor-tp-accuracy/logprobs_unequal_1.png)
+![Mismatch between prev_logprobs and current_logprobs under TP=4](../assets/dtensor-tp-accuracy/logprobs_unequal_1.png)
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

95-95: Images should have alternate text (alt text)

(MD045, no-alt-text)

🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` at line 95, The image markdown line for
the figure ("![](../assets/dtensor-tp-accuracy/logprobs_unequal_1.png)") is
missing alt text; update that markdown to include a concise descriptive alt
string (for example: alt="Plot of log-probabilities showing unequal values
across tensor parallel partitions" or similar) so screen readers can convey the
image content and purpose.

@yuki-97 yuki-97 enabled auto-merge (squash) January 15, 2026 01:42
Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Rayen <ruit@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: ruit <ruit@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@yuki-97 yuki-97 merged commit 174e5cb into r0.5.0 Jan 15, 2026
40 checks passed
@yuki-97 yuki-97 deleted the cherry-pick-1767-r0.5.0 branch January 15, 2026 05:08
avenkateshha pushed a commit to avenkateshha/RL that referenced this pull request Apr 10, 2026
…VIDIA-NeMo#1777)

Signed-off-by: Jonas Yang <joyang@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Rayen <ruit@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Jonas Yang CN <joyang@nvidia.com>
Co-authored-by: ruit <ruit@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:docs Run doctest Documentation Improvements or additions to documentation Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants