cp: docs: Adding dtensor TP debugging summary (1767) into r0.5.0#1777
Conversation
📝 WalkthroughWalkthroughA new comprehensive documentation guide detailing accuracy challenges encountered with DTensor tensor parallelism in RL training, covering root causes such as batch-variant kernels and cross-device reductions. The guide includes mitigation strategies and concrete code examples. Documentation index updated to include the new guide. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/guides/dtensor-tp-accuracy.md`:
- Line 67: The <img> tag referencing
"../assets/dtensor-tp-accuracy/validation_accuracy.png" is missing alt text; add
a descriptive alt attribute to the image element (e.g., alt="Validation accuracy
over training steps for DTensor tensor-parallel experiment") so screen readers
and accessibility tools can convey the image content; update the <img
src="../assets/dtensor-tp-accuracy/validation_accuracy.png" ... /> element to
include this alt attribute.
- Line 95: The image markdown line for the figure
("") is missing alt
text; update that markdown to include a concise descriptive alt string (for
example: alt="Plot of log-probabilities showing unequal values across tensor
parallel partitions" or similar) so screen readers can convey the image content
and purpose.
- Around line 47-51: The markdown table block starting with the line beginning
"| | TP=1 | TP=2..." and ending with the "<p
align="center"><em>Table 1: The validation loss of reward model
training</em></p>" line needs a blank line inserted immediately before the table
and another blank line immediately after the caption paragraph so the table and
its caption are surrounded by empty lines for correct Markdown rendering.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (6)
docs/assets/dtensor-tp-accuracy/image-20260111142255534.pngis excluded by!**/*.pngdocs/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.pngis excluded by!**/*.pngdocs/assets/dtensor-tp-accuracy/kl_hf_prev.pngis excluded by!**/*.pngdocs/assets/dtensor-tp-accuracy/logprobs_unequal_1.pngis excluded by!**/*.pngdocs/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.pngis excluded by!**/*.pngdocs/assets/dtensor-tp-accuracy/validation_accuracy.pngis excluded by!**/*.png
📒 Files selected for processing (2)
docs/guides/dtensor-tp-accuracy.mddocs/index.md
🧰 Additional context used
📓 Path-based instructions (2)
docs/**/*.md
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Update docs/index.md when a new markdown doc is added under docs/**/*.md or a markdown file is renamed, ensuring the document appears in the most appropriate section
Files:
docs/index.mddocs/guides/dtensor-tp-accuracy.md
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
docs/index.mddocs/guides/dtensor-tp-accuracy.md
🪛 LanguageTool
docs/guides/dtensor-tp-accuracy.md
[style] ~13-~13: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...under different TP configurations. 3. For overall model training performance: U...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~179-~179: Consider using a different verb to strengthen your wording.
Context: ...\right]$. ### Root Cause Our analysis shows that the **row-wise (colwise) tensor pa...
(SHOW_INDICATE)
🪛 markdownlint-cli2 (0.18.1)
docs/guides/dtensor-tp-accuracy.md
50-50: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
67-67: Images should have alternate text (alt text)
(MD045, no-alt-text)
95-95: Images should have alternate text (alt text)
(MD045, no-alt-text)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: sphinx-build / Build docs
- GitHub Check: Lint check
- GitHub Check: sphinx-build / Build docs
- GitHub Check: build-container / main
- GitHub Check: Lint check
- GitHub Check: build-container / main
- GitHub Check: sphinx-build / Build docs
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
docs/guides/dtensor-tp-accuracy.md (1)
1-242: Excellent comprehensive documentation on DTensor TP accuracy issues.This guide provides thorough coverage of tensor parallelism accuracy challenges in RL training, including:
- Well-documented observed issues with concrete examples and metrics
- Clear mathematical explanations of row-wise vs column-wise parallelism
- Actionable mitigation strategies with code examples
- Proper context with references to external research
The technical content is accurate, the structure is logical, and the explanations are detailed enough for practitioners to understand and apply the solutions.
docs/index.md (1)
219-219: Documentation index correctly updated.The new DTensor TP accuracy guide has been properly added to the Guides section of the documentation index, satisfying the coding guideline requirement to update docs/index.md when adding new markdown documentation.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| | | TP=1 | TP=2 | TP=4 | TP=8 | | ||
| | ------------- | ------ | ------ | ------ | ------ | | ||
| | With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | ||
| | W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | ||
| <p align="center"><em>Table 1: The validation loss of reward model training</em></p> |
There was a problem hiding this comment.
Add blank lines around the table for proper markdown formatting.
Table 1 should be surrounded by blank lines according to markdown style guidelines. This ensures consistent rendering across different markdown parsers.
📝 Proposed fix
the discrepancy. Instead, these results imply that the primary source of inconsistency lies in how different TP plans partition and aggregate computations across devices, rather than in precision loss from autocast.
| | TP=1 | TP=2 | TP=4 | TP=8 |
| ------------- | ------ | ------ | ------ | ------ |
| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
+
<p align="center"><em>Table 1: The validation loss of reward model training</em></p>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | | TP=1 | TP=2 | TP=4 | TP=8 | | |
| | ------------- | ------ | ------ | ------ | ------ | | |
| | With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | |
| | W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | |
| <p align="center"><em>Table 1: The validation loss of reward model training</em></p> | |
| | | TP=1 | TP=2 | TP=4 | TP=8 | | |
| | ------------- | ------ | ------ | ------ | ------ | | |
| | With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | |
| | W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | | |
| <p align="center"><em>Table 1: The validation loss of reward model training</em></p> |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
50-50: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` around lines 47 - 51, The markdown table
block starting with the line beginning "| | TP=1 | TP=2..." and
ending with the "<p align="center"><em>Table 1: The validation loss of reward
model training</em></p>" line needs a blank line inserted immediately before the
table and another blank line immediately after the caption paragraph so the
table and its caption are surrounded by empty lines for correct Markdown
rendering.
|
|
||
| <p align="center"><em>Fig 2: The reward of Qwen3-4B</em></p> | ||
|
|
||
| <img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" /> |
There was a problem hiding this comment.
Add alt text to the image for accessibility.
The image reference is missing alt text, which is important for accessibility and screen readers.
♿ Proposed fix
-<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" />
+<img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" alt="Validation accuracy comparison for Qwen3-4B under different TP configurations" style="zoom:33%;" />📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" style="zoom:33%;" /> | |
| <img src="../assets/dtensor-tp-accuracy/validation_accuracy.png" alt="Validation accuracy comparison for Qwen3-4B under different TP configurations" style="zoom:33%;" /> |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
67-67: Images should have alternate text (alt text)
(MD045, no-alt-text)
🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` at line 67, The <img> tag referencing
"../assets/dtensor-tp-accuracy/validation_accuracy.png" is missing alt text; add
a descriptive alt attribute to the image element (e.g., alt="Validation accuracy
over training steps for DTensor tensor-parallel experiment") so screen readers
and accessibility tools can convey the image content; update the <img
src="../assets/dtensor-tp-accuracy/validation_accuracy.png" ... /> element to
include this alt attribute.
|
|
||
| This ratio is the standard importance ratio used in off-policy RL to reweight returns when the data are collected under an older behavior policy. In on-policy training, this ratio should be exactly 1. However, in our experiments, we observed cases where the ratio deviates from 1, indicating a mismatch between the intended on-policy setting and the actual behavior of the system. Figure 4 and Figure 5 illustrate this phenomenon by showing the mismatch between `prev_logprobs` and `current_logprobs` under TP=4, as well as the reward curves under TP=4 and TP=1 for the `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` model. | ||
|
|
||
|  |
There was a problem hiding this comment.
Add alt text to the image for accessibility.
The image reference is missing alt text, which is important for accessibility and screen readers.
♿ Proposed fix
-
+🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
95-95: Images should have alternate text (alt text)
(MD045, no-alt-text)
🤖 Prompt for AI Agents
In `@docs/guides/dtensor-tp-accuracy.md` at line 95, The image markdown line for
the figure ("") is
missing alt text; update that markdown to include a concise descriptive alt
string (for example: alt="Plot of log-probabilities showing unequal values
across tensor parallel partitions" or similar) so screen readers can convey the
image content and purpose.
Signed-off-by: Jonas Yang <joyang@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Rayen <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
8efd68d to
d6112f3
Compare
…VIDIA-NeMo#1777) Signed-off-by: Jonas Yang <joyang@nvidia.com> Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Rayen <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Jonas Yang CN <joyang@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
beep boop [🤖]: Hi @joyang-nv 👋,
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.