NVIDIA
diff --git a/‎examples/puzzletron/automodel_distillation/README.md‎
Lines changed: 123 additions & 0 deletions b/‎examples/puzzletron/automodel_distillation/README.md‎
Lines changed: 123 additions & 0 deletions
diff --git a/‎examples/puzzletron/automodel_distillation/kd.yaml‎
Lines changed: 133 additions & 0 deletions b/‎examples/puzzletron/automodel_distillation/kd.yaml‎
Lines changed: 133 additions & 0 deletions
diff --git a/‎examples/puzzletron/automodel_distillation/loss.py‎
Lines changed: 141 additions & 0 deletions b/‎examples/puzzletron/automodel_distillation/loss.py‎
Lines changed: 141 additions & 0 deletions
@@ -0,0 +1,123 @@
+# Knowledge Distillation with NeMo AutoModel
+
+This guide shows how to run knowledge distillation on Puzzletron-compressed AnyModel (heterogeneous) checkpoints using **NeMo AutoModel**. AutoModel enables efficient training of any HuggingFace model with a unified API; here we extend it to load heterogeneous checkpoints and use TP-friendly KD loss.
+
+## Overview
+
+1. **AutoModel + AnyModel**: We monkey-patch NeMo AutoModel so `from_pretrained(..., anymodel_descriptor=..., block_configs_path=...)` can load heterogeneous checkpoints. The patch uses ModelOpt’s `ModelDescriptorFactory` and `deci_x_patcher` to apply per-layer configs during model init.
+2. **Custom KD recipe**: For distillation we use a custom recipe (`recipe.py`) that adds pipeline-parallel (PP) support, better logging, and TP-friendly KD loss. Pretraining is unchanged and uses AutoModel’s built-in recipe. Once the AutoModel repo gains these features, the custom recipe can be dropped and the upstream KD recipe used instead.
+3. **KD loss** (`loss.py`): We provide a TP-aware KD on precomputed logits only; CE is computed separately and mixed with `kd_ratio`.
+
+**Supported parallelisms**  
+FSDP is fully supported. Pipeline parallelism (PP) is supported for most models; exceptions are those whose layer naming does not follow the usual HuggingFace convention. Tensor parallelism (TP) and sequence parallelism (SP) are mostly supported—a known exception is GPT-OSS due to sink tokens (AutoModel has the same limitation; it is not specific to AnyModel). Context parallelism (CP) is supported for all models tested. Expert parallelism (EP) is not supported: AutoModel relies on custom (non–HuggingFace) model implementations for EP, which conflicts with the goal of supporting any HF model.
+
+## Setup
+
+**Requirements**
+
+- NeMo AutoModel (install from source or use a container that provides it).
+- ModelOpt installed (`pip install nvidia-modelopt` or install from the Model-Optimizer repo).
+- For KD: this example’s `recipe.py`, `loss.py`, and `patch_automodel.py` (the run entrypoint always applies the patch before loading models).
+
+**Environment**
+
+Set `PYTHONPATH` so that the Model-Optimizer root is on the path (for ModelOpt and, if you run this example as a module, for `automodel_distillation`):
+
+```bash
+export PYTHONPATH="/path/to/Model-Optimizer:${PYTHONPATH}"
+```
+
+If you use a NeMo AutoModel container, ensure the AutoModel package is installed (e.g. clone AutoModel and `pip install -e .`). Upgrade HuggingFace Transformers if needed (e.g. for compatibility):
+
+```bash
+python -m pip install -e /path/to/AutoModel
+python -m pip install -U omegaconf fire transformers
+```
+
+## Configuration
+
+- **pretrain.yaml** – Pretrain/finetune on an AnyModel checkpoint. Set `model.pretrained_model_name_or_path` and `model.anymodel_descriptor` (e.g. `gpt_oss_20b`, `llama`, `qwen2`, `qwen3`). Optional: `model.block_configs_path`; if omitted, block configs are auto-detected from `<checkpoint_dir>/block_configs.json`.
+- **kd.yaml** – Knowledge distillation. Set `model.pretrained_model_name_or_path` and `model.anymodel_descriptor` for the student, and `teacher_model.pretrained_model_name_or_path` and `teacher_model.anymodel_descriptor` for the teacher.
+
+Paths and descriptors can be overridden from the command line (see below).
+
+## Run
+
+**Apply the patch and run KD**
+
+Before loading models, the run entrypoint calls `apply_patch()` so that `from_pretrained` accepts `anymodel_descriptor` and `block_configs_path`. Then it loads the config and runs the chosen recipe.
+
+Run from the **automodel_distillation** directory so that `run.py` can import `patch_automodel` and `recipe`:
+
+```bash
+cd /path/to/Model-Optimizer/examples/puzzletron/automodel_distillation
+torchrun --nproc_per_node=2 \
+  -m run \
+  --mode kd \
+  -c kd.yaml
+```
+
+Override config (e.g. paths and descriptor) on the command line:
+
+```bash
+torchrun --nproc_per_node=2 \
+  -m run \
+  --mode kd \
+  -c kd.yaml \
+  model.pretrained_model_name_or_path=/path/to/student \
+  model.anymodel_descriptor=gpt_oss_20b \
+  teacher_model.pretrained_model_name_or_path=/path/to/teacher \
+  teacher_model.anymodel_descriptor=gpt_oss_20b
+```
+
+**Pretrain (uses AutoModel’s built-in recipe)**
+
+```bash
+torchrun --nproc_per_node=2 \
+  -m run \
+  --mode pretrain \
+  -c pretrain.yaml \
+  model.pretrained_model_name_or_path=/path/to/checkpoint \
+  model.anymodel_descriptor=gpt_oss_20b
+```
+
+**Note:** If you run from a different layout (e.g. from the Model-Optimizer repo root or under another package name), set `PYTHONPATH` to include this directory so `run` can import `patch_automodel` and `recipe`, and ensure the config `kd_loss_fn._target_` (e.g. `loss.KDLoss`) resolves to the correct module.
+
+## Example: Running on a cluster
+
+Below is an example job setup: NeMo AutoModel container, clone AutoModel main, install it and upgrade Transformers, then run KD from a directory that contains your config and run script (e.g. a copy of this example or the RealAnyModel layout).
+
+```bash
+# Submit interactive job (example with your cluster’s submit_job)
+submit_job --partition interactive --time 2.0 \
+  --image nvcr.io/nvidia/nemo-automodel:25.11.00 \
+  --mounts "/path/to/AutoModel/:/opt/Automodel/,/lustre:/lustre" \
+  --interactive --gpu 2 --skip_image_check --email_mode=never \
+  --command='bash'
+
+# Inside the container
+source /opt/venv/bin/activate
+cd /opt/Automodel/
+python -m pip install -e .
+python -m pip install -U omegaconf fire transformers
+python -m pip uninstall nvidia-modelopt
+cd /path/to/Model-Optimizer
+python -m pip install -e .
+
+# Run KD (from your project dir that has run.py, kd.yaml, patch_automodel, loss, recipe)
+cd ./examples/puzzletron/automodel_distillation/
+torchrun --nproc_per_node 2 -m run --mode kd -c kd.yaml 2>&1 | tee logs
+```
+
+Use your own paths for mounts, checkpoint dirs, and config overrides as needed.
+
+## Files in this example
+
+| File | Purpose |
+|------|--------|
+| `patch_automodel.py` | Monkey-patch so `from_pretrained` accepts `anymodel_descriptor` and `block_configs_path`; uses ModelOpt’s `deci_x_patcher`. |
+| `loss.py` | KDLoss: TP-aware KD on precomputed logits (CE is mixed via `kd_ratio` in the recipe). |
+| `recipe.py` | Custom KD recipe (PP support, logging, TP-friendly KD). |
+| `run.py` | Entrypoint: applies patch, then runs pretrain or KD using the config. |
+| `pretrain.yaml` | Pretrain config (no hardcoded paths; override on CLI). |
+| `kd.yaml` | KD config (no hardcoded paths; override on CLI). |
@@ -0,0 +1,133 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Knowledge distillation: student and teacher are AnyModel checkpoints.
+# Requires apply_patch() from patch_automodel. Set model and teacher_model paths and descriptors.
+# anymodel_descriptor must match a ModelOpt ModelDescriptorFactory name (e.g. gpt_oss_20b, llama, qwen2, qwen3).
+#
+# KD loss (kd_loss_fn._target_): use loss.KDLoss for TP-aware KD on precomputed logits.
+# CE is computed by loss_fn and mixed with KD via kd_ratio in the recipe.
+# If running under a different package name, use that module path (e.g. automodel_distillation.loss.KDLoss).
+#
+# To run:
+#   torchrun --nproc_per_node <N> -m automodel_distillation.run --mode kd -c kd.yaml
+# Override: model.pretrained_model_name_or_path=/path/to/student model.anymodel_descriptor=llama ...
+
+step_scheduler:
+  global_batch_size: 128
+  local_batch_size: 4
+  ckpt_every_steps: 200
+  val_every_steps: 100
+  num_epochs: 2
+
+dist_env:
+  backend: nccl
+  timeout_minutes: 5
+
+rng:
+  _target_: nemo_automodel.components.training.rng.StatefulRNG
+  seed: 1111
+  ranked: true
+
+model:
+  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
+  pretrained_model_name_or_path: ./heterogeneous_ckpts/meta-llama-Llama-3.1-8B-Instruct/  # student checkpoint dir
+  anymodel_descriptor: llama  # e.g. gpt_oss_20b, llama, qwen2, qwen3
+  force_hf: true
+  torch_dtype: bf16
+  trust_remote_code: true
+
+teacher_model:
+  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
+  pretrained_model_name_or_path: ./heterogeneous_ckpts/meta-llama-Llama-3.1-8B-Instruct-teacher/  # teacher checkpoint dir
+  anymodel_descriptor: llama  # same format as model.anymodel_descriptor
+  force_hf: true
+  torch_dtype: bf16
+  trust_remote_code: true
+
+checkpoint:
+  enabled: true
+  checkpoint_dir: checkpoints/
+  model_save_format: safetensors
+  save_consolidated: false
+
+distributed:
+  dp_size: none
+  tp_size: 2
+  cp_size: 1
+  ep_size: 1
+  sequence_parallel: false
+  pp_size: 1
+  pipeline:
+    pp_schedule: interleaved1f1b
+    pp_microbatch_size: 1
+    scale_grads_in_schedule: false
+    round_virtual_stages_to_pp_multiple: up
+    dtype: bf16
+
+distributed_config:
+  _target_: nemo_automodel.components.distributed.config.FSDP2Config
+  activation_checkpointing: false
+
+compile_config:
+  enabled: true
+
+packed_sequence:
+  packed_sequence_size: 1024
+  split_across_pack: false
+
+loss_fn:
+  _target_: nemo_automodel.components.loss.te_parallel_ce.TEParallelCrossEntropy
+
+# 0 = pure CE (better to run pretrain instead of loading a teacher and not using it)
+# 1 = pure KD (common practice for puzzletron distillation)
+kd_ratio: 1.0
+
+kd_loss_fn:
+  _target_: loss.KDLoss
+  ignore_index: -100
+  temperature: 1.0
+  fp32_upcast: true
+
+optimizer:
+  _target_: torch.optim.Adam
+  betas: [0.9, 0.999]
+  eps: 1.0e-8
+  lr: 1.0e-5
+  weight_decay: 0
+
+dataset:
+  _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
+  dataset_name: rajpurkar/squad
+  split: train
+
+dataloader:
+  _target_: torchdata.stateful_dataloader.StatefulDataLoader
+  collate_fn: nemo_automodel.components.datasets.utils.default_collater
+  shuffle: false
+
+validation_dataset:
+  _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
+  dataset_name: rajpurkar/squad
+  split: validation
+
+validation_dataloader:
+  _target_: torchdata.stateful_dataloader.StatefulDataLoader
+  collate_fn: nemo_automodel.components.datasets.utils.default_collater
+
+# wandb:
+#   project: <your_project>
+#   entity: <your_entity>
+#   name: <your_run_name>
+#   save_dir: <your_save_dir>
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributed.tensor import DTensor, Shard
+
+
+def _infer_tp_group_from_dtensor(tensor: "torch.Tensor"):
+    """Return device_mesh process group if tensor is a DTensor sharded on vocab (logits last dim, lm_head dim 0)."""
+    if not isinstance(tensor, DTensor):
+        return None
+    # Vocab sharding: Shard on last dim (logits) or Shard(0) (weight matrix)
+    has_shard = any(isinstance(p, Shard) for p in tensor.placements)
+    if not has_shard:
+        return None
+    return tensor.device_mesh.get_group()
+
+
+def _kl_forward_tp(
+    t_logits: torch.Tensor,
+    s_logits: torch.Tensor,
+    tp_group,
+) -> torch.Tensor:
+    """
+    Compute KL (negative cross entropy sum(P*log Q)) with tensor parallelism.
+    Returns per-token negative cross entropy (sum over vocab).
+    """
+    teacher_max = t_logits.max(dim=-1, keepdim=True).values
+    dist.all_reduce(teacher_max, op=dist.ReduceOp.MAX, group=tp_group)
+    output_teacher = t_logits - teacher_max
+
+    denom_teacher = torch.exp(output_teacher).sum(dim=-1, keepdim=True)
+    dist.all_reduce(denom_teacher, op=dist.ReduceOp.SUM, group=tp_group)
+    teacher_prob = torch.exp(output_teacher - torch.log(denom_teacher.clamp(min=1e-12)))
+
+    student_max = s_logits.max(dim=-1, keepdim=True).values
+    dist.all_reduce(student_max, op=dist.ReduceOp.MAX, group=tp_group)
+    output_student = s_logits - student_max.detach()
+
+    denom_student = torch.exp(output_student).sum(dim=-1, keepdim=True)
+    dist.all_reduce(denom_student, op=dist.ReduceOp.SUM, group=tp_group)
+    student_log_prob = output_student - torch.log(denom_student.clamp(min=1e-12))
+
+    term = teacher_prob * student_log_prob
+    inf_mask = torch.isinf(s_logits)
+    term = torch.masked_fill(term, inf_mask, 0.0)
+    ce_local = term.sum(dim=-1)
+    dist.all_reduce(ce_local, op=dist.ReduceOp.SUM, group=tp_group)
+    return ce_local.view(-1)
+
+
+class KDLoss(nn.Module):
+    """TP-aware KD on precomputed logits."""
+
+    def __init__(
+        self,
+        ignore_index: int = -100,
+        temperature: float = 1.0,
+        fp32_upcast: bool = True,
+        tp_group=None,
+        **kwargs,
+    ):
+        super().__init__()
+        self.ignore_index = ignore_index
+        self.temperature = temperature
+        self.fp32_upcast = fp32_upcast
+        self.tp_group = tp_group
+
+    def forward(
+        self,
+        student_logits: torch.Tensor,
+        teacher_logits: torch.Tensor,
+        labels: torch.Tensor,
+        num_batch_labels: int | None = None,
+    ) -> torch.Tensor:
+        valid_mask = (labels != self.ignore_index).view(-1)
+        if valid_mask.sum() == 0:
+            return student_logits.new_tensor(0.0)
+
+        if student_logits.ndim > 2:
+            student_logits = student_logits.view(-1, student_logits.shape[-1])
+        if teacher_logits.ndim > 2:
+            teacher_logits = teacher_logits.view(-1, teacher_logits.shape[-1])
+        if labels.ndim > 1:
+            labels = labels.view(-1)
+
+        tp_group = self.tp_group
+        if isinstance(student_logits, DTensor) and tp_group is None:
+            tp_group = _infer_tp_group_from_dtensor(student_logits)
+
+        if tp_group is not None:
+            if isinstance(student_logits, DTensor):
+                student_logits = student_logits.to_local()
+            if isinstance(teacher_logits, DTensor):
+                teacher_logits = teacher_logits.to_local()
+        else:
+            if isinstance(student_logits, DTensor):
+                student_logits = student_logits.full_tensor()
+            if isinstance(teacher_logits, DTensor):
+                teacher_logits = teacher_logits.full_tensor()
+
+        t_logits = teacher_logits[valid_mask]
+        s_logits = student_logits[valid_mask]
+
+        if self.fp32_upcast:
+            t_logits = t_logits.float()
+            s_logits = s_logits.float()
+        if self.temperature != 1.0:
+            t_logits = t_logits.mul(1.0 / self.temperature)
+            s_logits = s_logits.mul(1.0 / self.temperature)
+
+        if tp_group is not None:
+            kl_per_token = _kl_forward_tp(t_logits, s_logits, tp_group)
+        else:
+            teacher_prob = F.softmax(t_logits, dim=-1, dtype=torch.float32)
+            student_logprob = F.log_softmax(s_logits, dim=-1, dtype=torch.float32)
+            inf_mask = torch.isinf(s_logits)
+            kl_per_token = (
+                torch.masked_fill(teacher_prob * student_logprob, inf_mask, 0.0).sum(-1).view(-1)
+            )
+
+        if self.temperature != 1.0:
+            kl_per_token = kl_per_token * (self.temperature**2)
+
+        if num_batch_labels is not None:
+            return -torch.sum(kl_per_token) / num_batch_labels
+        return -torch.mean(kl_per_token)