Skip to content

v2.1.27#686

Merged
joellidin merged 6 commits intomainfrom
dev
Jan 18, 2026
Merged

v2.1.27#686
joellidin merged 6 commits intomainfrom
dev

Conversation

@joellidin
Copy link
Collaborator

@joellidin joellidin commented Jan 18, 2026

  • (neurons) Skip batches with all masked labels
  • (neurons) Switch anneal mode to shard 5
  • Bump run version
  • (hparams) Update anneal scheduler

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

  • New Features

    • Added a runtime guard to skip batches that contain no valid labels, preventing wasted work and noisy errors during distributed training.
  • Chores

    • Version bumped to 2.1.27.
    • Switched anneal-mode shard target from 4 to 5.
    • Extended anneal decay outer steps (120 → 150).
  • Documentation

    • Updated dataset migration example commands to reference the new shard.

✏️ Tip: You can customize this high-level summary in your review settings.

Add guards in evaluate_model and inner_steps to prevent NaN loss when
all labels in a batch are masked (-100). This occurs when batches
contain only padding or special tokens.

- Check valid_labels count before forward pass
- Log warning and skip batch if valid_labels == 0
- Clean up tensors before continuing to next batch
- Prevent cross_entropy from receiving empty loss target
Update miner and validator to use anneal shard 5 instead of shard 4.
Update documentation to reflect the new shard number in rclone migration
examples.

- Change current_shard from 4 to 5 in miner.py
- Change current_shard from 4 to 5 in validator.py
- Update docs with anneal_000005.npy examples
Change from 120 to 150 to mitigate for now alwyas gathering the full 20
peers.
@coderabbitai
Copy link

coderabbitai bot commented Jan 18, 2026

Walkthrough

Updates switch anneal-mode shard selection from 4→5 in runtime and docs, extend anneal decay_outer_steps 120→150, add distributed batch-skip guard for all-labels-masked batches in trainer, and bump package version to 2.1.27.

Changes

Cohort / File(s) Summary
Anneal Mode Shard Configuration
neurons/miner.py, neurons/validator.py
Changed hard-coded anneal-mode initial shard from 4 to 5 and reset shard_epoch to 0 during shard initialization.
Hyperparameter & Documentation Updates
hparams/hparams.json, docs/shared_sharded_dataset.md
Updated anneal_mode.decay_outer_steps from 120 to 150; docs examples changed to reference shard 5 (file path adjustments).
Training Safety Guards
neurons/trainer.py
Added distributed check to detect batches with all labels masked (-100) and skip such batches across ranks with warning logging and input cleanup. Implemented in evaluate_model and inner_steps paths.
Version Bump
src/tplr/__init__.py
Bumped __version__ from "2.1.26" to "2.1.27".

Sequence Diagram(s)

sequenceDiagram
  participant Rank as Worker Rank
  participant All as Collective (all_ok)
  participant Master as Master / Logger

  Rank->>Rank: prepare labels\n(has_valid_labels = (labels != -100).any())
  Rank->>All: all_ok(has_valid_labels)
  All-->>Rank: all_ok_result (bool)
  alt all_ok_result == true
    Rank->>Rank: proceed with forward/backward
  else all_ok_result == false
    Rank->>Master: (only master) log warning "dropping masked batch"
    Rank->>Rank: delete input_ids, labels\ncontinue (skip batch)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • fix/nan loss #685: Implements the same combined changes—batch-skip guard, shard 4→5 updates, anneal decay tweak, docs edits, and version bump.
  • v2.1.12 #638: Related changes to shard-handling logic in neurons/miner.py / neurons/validator.py (adds last_shard-based swap logic).
  • feat/new anneal shard #683: Modifies anneal-mode shard selection and related docs/hparams (different target shard in that PR).

Suggested reviewers

  • shivam-MBZUAI
  • amiiir-sarfi

Poem

🐰 Hoppity-hop, shard five in sight,

Masked batches skipped till morning light,
Decay stretched out, the schedule sings,
Version raised — the rabbit springs! 🎉

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description contains only a bullet-point list of changes without filling in the required template sections like Description, Type of Change, or any narrative explanation of the changes. Complete the pull request template by adding a detailed Description section explaining the 'what' and 'why', selecting appropriate Type of Change checkboxes, and filling in other relevant template sections.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'v2.1.27' is minimal and only indicates a version bump but does not convey what substantial changes were made (batch skipping, shard switching, scheduler update). Consider a more descriptive title that highlights the main functional changes, such as 'v2.1.27: Add batch masking guards, switch anneal shard, update scheduler' to better communicate the pull request's purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (57.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   57.69%   57.69%           
=======================================
  Files          27       27           
  Lines        4990     4990           
=======================================
  Hits         2879     2879           
  Misses       2111     2111           
Files with missing lines Coverage Δ
src/tplr/__init__.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@neurons/trainer.py`:
- Around line 854-862: The batch-skip decision is currently local (uses
valid_labels and batch_count) which can break collective ops later (e.g.,
ddp_reduce); make the skip decision collective by computing a tensor flag like
local_has_valid = torch.tensor(1 if (labels != -100).any() else 0,
device=labels.device) and all-reducing it across the process group (e.g.,
torch.distributed.all_reduce(local_has_valid, op=torch.distributed.ReduceOp.SUM)
or torch.distributed.reduce with world_size) to derive a single boolean
(any_rank_has_valid or all_ranks_have_no_valid) and then use that shared result
to either skip the batch on every rank (delete input_ids, labels and continue)
or proceed with the forward/backward and ddp_reduce; update the code paths
around valid_labels, batch_count, input_ids, labels, and any subsequent
ddp_reduce to rely on this collective decision.

Make batch-skip decision collective across all ranks to prevent DDP
collective operation failures. Previously, each rank independently
decided to skip batches with all-masked labels, which could cause some
ranks to skip while others continued, breaking ddp_reduce.

- Use dist_helper.all_ok() to synchronize skip decision
- Apply fix to both evaluate_model and inner training loop
- Skip on all ranks if any rank has all-masked labels
@joellidin joellidin merged commit 8dafcd2 into main Jan 18, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant