feat: add pipeline parallelism support for knowledge distillation by Separius · Pull Request #1500 · NVIDIA-NeMo/Automodel

Separius · 2026-03-09T18:08:52Z

Add PP support to KnowledgeDistillationRecipeForNextTokenPrediction:

Add _build_teacher_model_with_pp: builds the teacher as an AutoPipeline mirroring the student's PP config, with a capture closure that stores last-stage logits in _teacher_logits_capture after each eval pass.
Add _make_pp_kd_loss_wrapper: injects into the student schedule's _loss_fn; reads _current_teacher_logits set by the teacher eval pass and returns (1-ratio)ce + ratiokd.
Add _forward_backward_step_pp: runs teacher eval first to capture logits, then runs the student step/eval.
Add _run_train_optim_step_pp: full PP training step with grad accumulation, norm clipping, and cross-rank loss aggregation for logging.
Add run_train_validation_loop PP-aware override (validation skipped).
setup() wires up all of the above; removes the old ValueError.

Bug fixes applied from reference implementation:

Fix corrupted has_packed_sequence kwarg in teacher PP builder.
Guard _forward_backward_step against being called when PP is enabled.
Skip CE computation when kd_ratio >= 1.0 (avoid wasted forward pass).
Add missing metric_logger_train.log(log_data) in log_train_metrics.
Conditionalize ce_loss display in log strings on kd_ratio < 1.0.

Known limitation: when pp_microbatch_size < pp_batch_size only the last microbatch's teacher logits are retained; set pp_microbatch_size == pp_batch_size when using PP with KD.

Add 7 unit tests covering PP-specific logic (capture closure, wrapper combination math, kd_ratio edge cases, buffer accumulation).

Add PP support to KnowledgeDistillationRecipeForNextTokenPrediction: - Add `_build_teacher_model_with_pp`: builds the teacher as an AutoPipeline mirroring the student's PP config, with a capture closure that stores last-stage logits in `_teacher_logits_capture` after each eval pass. - Add `_make_pp_kd_loss_wrapper`: injects into the student schedule's `_loss_fn`; reads `_current_teacher_logits` set by the teacher eval pass and returns (1-ratio)*ce + ratio*kd. - Add `_forward_backward_step_pp`: runs teacher eval first to capture logits, then runs the student step/eval. - Add `_run_train_optim_step_pp`: full PP training step with grad accumulation, norm clipping, and cross-rank loss aggregation for logging. - Add `run_train_validation_loop` PP-aware override (validation skipped). - `setup()` wires up all of the above; removes the old ValueError. Bug fixes applied from reference implementation: - Fix corrupted `has_packed_sequence` kwarg in teacher PP builder. - Guard `_forward_backward_step` against being called when PP is enabled. - Skip CE computation when `kd_ratio >= 1.0` (avoid wasted forward pass). - Add missing `metric_logger_train.log(log_data)` in `log_train_metrics`. - Conditionalize ce_loss display in log strings on `kd_ratio < 1.0`. Known limitation: when pp_microbatch_size < pp_batch_size only the last microbatch's teacher logits are retained; set pp_microbatch_size == pp_batch_size when using PP with KD. Add 7 unit tests covering PP-specific logic (capture closure, wrapper combination math, kd_ratio edge cases, buffer accumulation).

copy-pr-bot · 2026-03-09T18:08:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Separius · 2026-03-09T18:09:05Z

@akoumpa for visibility

akoumpa · 2026-03-09T18:14:43Z

/ok to test 134a577

akoumpa · 2026-03-11T06:04:38Z

/ok to test 2d773fc

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-17T23:27:33Z

/ok to test e6da11a

akoumpa · 2026-03-18T03:50:00Z

/claude review

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-18T04:05:21Z

/claude review

claude · 2026-03-18T04:10:08Z

+        ce_tensor = (
+            torch.stack(self._ce_loss_buffer).sum()
+            if self._ce_loss_buffer
+            else torch.tensor(0.0, device=self.dist_env.device)
+        )
+        kd_tensor = (
+            torch.stack(self._kd_loss_buffer).sum()
+            if self._kd_loss_buffer
+            else torch.tensor(0.0, device=self.dist_env.device)
+        )
+        ce_tensor = self._dp_allreduce(ce_tensor, include_cp=True)
+        kd_tensor = self._dp_allreduce(kd_tensor, include_cp=True)
+        ce_tensor = ce_tensor.to(self.dist_env.device)
+        kd_tensor = kd_tensor.to(self.dist_env.device)
+        if self.dist_env.rank == src_rank and not self.dist_env.is_main:
+            torch.distributed.send(ce_tensor, dst=0)
+            torch.distributed.send(kd_tensor, dst=0)
+        elif self.dist_env.is_main and self.dist_env.rank != src_rank:
+            torch.distributed.recv(ce_tensor, src=src_rank)
+            torch.distributed.recv(kd_tensor, src=src_rank)
+        ce_loss = ce_tensor.cpu().item()
+        kd_loss = kd_tensor.cpu().item()


Same normalization mismatch applies to the logged KD/CE metrics in the PP path. Since ce_loss is a sum and kd_loss is a mean (before the fix above), the values in _ce_loss_buffer and _kd_loss_buffer are on different scales. Additionally, unlike reporting_loss which is divided by num_label_tokens on line 683, ce_loss and kd_loss here are not normalized at all — they're raw allreduced values.

In the non-PP path (line 586-587), both buffers contain per-token-normalized values (since num_label_tokens is passed to the loss functions). So the PP and non-PP paths log ce_loss/kd_loss on different scales, making them incomparable across runs with different parallelism configs.

After fixing num_batch_labels=1 above, both buffers will contain sums, and you'd want to divide them by num_label_tokens here (similar to line 683 for reporting_loss).

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-03-18T05:10:30Z

/ok to test f94012a

Separius requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners March 9, 2026 18:08

Separius mentioned this pull request Mar 9, 2026

feat: add pp support to kd #1323

Closed

3 tasks

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 18:15 Inactive

copy-pr-bot bot temporarily deployed to test March 9, 2026 18:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 18:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 18:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 9, 2026 18:57 Inactive

Merge branch 'main' into ssameni/feat_kd_pp

2d773fc

copy-pr-bot bot had a problem deploying to nemo-ci March 17, 2026 22:53 Failure

copy-pr-bot bot temporarily deployed to test March 17, 2026 22:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 23:04 Inactive

fix

e6da11a

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 23:27 Inactive

copy-pr-bot bot temporarily deployed to test March 17, 2026 23:27 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 23:37 Inactive

claude bot reviewed Mar 18, 2026

View reviewed changes

Comment thread nemo_automodel/recipes/llm/kd.py Outdated

claude bot reviewed Mar 18, 2026

View reviewed changes

Comment thread tests/unit_tests/loss/test_kd_loss.py

fix double norm

8e3c8e0

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

claude bot reviewed Mar 18, 2026

View reviewed changes

Comment thread nemo_automodel/recipes/llm/kd.py

claude bot reviewed Mar 18, 2026

View reviewed changes

Comment thread nemo_automodel/recipes/llm/kd.py Outdated

akoumpa and others added 3 commits March 17, 2026 21:11

Update nemo_automodel/recipes/llm/kd.py

d329231

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

Update nemo_automodel/recipes/llm/kd.py

461d972

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

fix

f94012a

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa approved these changes Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pipeline parallelism support for knowledge distillation#1500

feat: add pipeline parallelism support for knowledge distillation#1500
akoumpa merged 10 commits intomainfrom
ssameni/feat_kd_pp

Separius commented Mar 9, 2026

Uh oh!

copy-pr-bot bot commented Mar 9, 2026

Uh oh!

Separius commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 11, 2026

Uh oh!

akoumpa commented Mar 17, 2026

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Uh oh!

claude bot Mar 18, 2026

Uh oh!

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Separius commented Mar 9, 2026

Uh oh!

copy-pr-bot bot commented Mar 9, 2026

Uh oh!

Separius commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 9, 2026

Uh oh!

akoumpa commented Mar 11, 2026

Uh oh!

akoumpa commented Mar 17, 2026

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Uh oh!

claude bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akoumpa commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants