v2.1.27 by joellidin · Pull Request #686 · one-covenant/templar

joellidin · 2026-01-18T14:03:26Z

(neurons) Skip batches with all masked labels
(neurons) Switch anneal mode to shard 5
Bump run version
(hparams) Update anneal scheduler

Description

Related Issue(s)

Closes #[issue number]

Type of Change

Feature (adding new functionality)
Fix (resolving a bug or issue)
Docs (documentation updates)
Refactor (code changes that don't affect functionality)
Maintenance (dependency updates or other maintenance)
Tests (adding or improving tests)
Breaking change (fix or feature with incompatible API changes)
Other: _____

Branch Naming

My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

My commits are small, atomic, and have proper commit messages
Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

I've performed a self-review of my code
I've added appropriate docstrings following the project's conventions
I've added proper logging where necessary (without trailing periods)
I've applied linting and formatting with Ruff
My code generates no new warnings

Testing

I've added tests for new functionality or bug fixes
All tests pass locally with my changes
Test coverage has not decreased

Documentation

I've updated documentation to reflect my changes
I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

New Features
- Added a runtime guard to skip batches that contain no valid labels, preventing wasted work and noisy errors during distributed training.
Chores
- Version bumped to 2.1.27.
- Switched anneal-mode shard target from 4 to 5.
- Extended anneal decay outer steps (120 → 150).
Documentation
- Updated dataset migration example commands to reference the new shard.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add guards in evaluate_model and inner_steps to prevent NaN loss when all labels in a batch are masked (-100). This occurs when batches contain only padding or special tokens. - Check valid_labels count before forward pass - Log warning and skip batch if valid_labels == 0 - Clean up tensors before continuing to next batch - Prevent cross_entropy from receiving empty loss target

Update miner and validator to use anneal shard 5 instead of shard 4. Update documentation to reflect the new shard number in rclone migration examples. - Change current_shard from 4 to 5 in miner.py - Change current_shard from 4 to 5 in validator.py - Update docs with anneal_000005.npy examples

Change from 120 to 150 to mitigate for now alwyas gathering the full 20 peers.

coderabbitai · 2026-01-18T14:03:38Z

Walkthrough

Updates switch anneal-mode shard selection from 4→5 in runtime and docs, extend anneal decay_outer_steps 120→150, add distributed batch-skip guard for all-labels-masked batches in trainer, and bump package version to 2.1.27.

Changes

Cohort / File(s)	Summary
Anneal Mode Shard Configuration `neurons/miner.py`, `neurons/validator.py`	Changed hard-coded anneal-mode initial shard from `4` to `5` and reset `shard_epoch` to `0` during shard initialization.
Hyperparameter & Documentation Updates `hparams/hparams.json`, `docs/shared_sharded_dataset.md`	Updated `anneal_mode.decay_outer_steps` from `120` to `150`; docs examples changed to reference shard `5` (file path adjustments).
Training Safety Guards `neurons/trainer.py`	Added distributed check to detect batches with all labels masked (`-100`) and skip such batches across ranks with warning logging and input cleanup. Implemented in `evaluate_model` and `inner_steps` paths.
Version Bump `src/tplr/__init__.py`	Bumped `__version__` from `"2.1.26"` to `"2.1.27"`.

Sequence Diagram(s)

sequenceDiagram
  participant Rank as Worker Rank
  participant All as Collective (all_ok)
  participant Master as Master / Logger

  Rank->>Rank: prepare labels\n(has_valid_labels = (labels != -100).any())
  Rank->>All: all_ok(has_valid_labels)
  All-->>Rank: all_ok_result (bool)
  alt all_ok_result == true
    Rank->>Rank: proceed with forward/backward
  else all_ok_result == false
    Rank->>Master: (only master) log warning "dropping masked batch"
    Rank->>Rank: delete input_ids, labels\ncontinue (skip batch)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix/nan loss #685: Implements the same combined changes—batch-skip guard, shard 4→5 updates, anneal decay tweak, docs edits, and version bump.
v2.1.12 #638: Related changes to shard-handling logic in neurons/miner.py / neurons/validator.py (adds last_shard-based swap logic).
feat/new anneal shard #683: Modifies anneal-mode shard selection and related docs/hparams (different target shard in that PR).

Suggested reviewers

shivam-MBZUAI
amiiir-sarfi

Poem

🐰 Hoppity-hop, shard five in sight,

Masked batches skipped till morning light,
Decay stretched out, the schedule sings,
Version raised — the rabbit springs! 🎉

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description contains only a bullet-point list of changes without filling in the required template sections like Description, Type of Change, or any narrative explanation of the changes.	Complete the pull request template by adding a detailed Description section explaining the 'what' and 'why', selecting appropriate Type of Change checkboxes, and filling in other relevant template sections.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'v2.1.27' is minimal and only indicates a version bump but does not convey what substantial changes were made (batch skipping, shard switching, scheduler update).	Consider a more descriptive title that highlights the main functional changes, such as 'v2.1.27: Add batch masking guards, switch anneal shard, update scheduler' to better communicate the pull request's purpose.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-18T14:06:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (57.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

@@           Coverage Diff           @@
##             main     #686   +/-   ##
=======================================
  Coverage   57.69%   57.69%           
=======================================
  Files          27       27           
  Lines        4990     4990           
=======================================
  Hits         2879     2879           
  Misses       2111     2111

Files with missing lines	Coverage Δ
src/tplr/__init__.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@neurons/trainer.py`:
- Around line 854-862: The batch-skip decision is currently local (uses
valid_labels and batch_count) which can break collective ops later (e.g.,
ddp_reduce); make the skip decision collective by computing a tensor flag like
local_has_valid = torch.tensor(1 if (labels != -100).any() else 0,
device=labels.device) and all-reducing it across the process group (e.g.,
torch.distributed.all_reduce(local_has_valid, op=torch.distributed.ReduceOp.SUM)
or torch.distributed.reduce with world_size) to derive a single boolean
(any_rank_has_valid or all_ranks_have_no_valid) and then use that shared result
to either skip the batch on every rank (delete input_ids, labels and continue)
or proceed with the forward/backward and ddp_reduce; update the code paths
around valid_labels, batch_count, input_ids, labels, and any subsequent
ddp_reduce to rely on this collective decision.

neurons/trainer.py

Make batch-skip decision collective across all ranks to prevent DDP collective operation failures. Previously, each rank independently decided to skip batches with all-masked labels, which could cause some ranks to skip while others continued, breaking ddp_reduce. - Use dist_helper.all_ok() to synchronize skip decision - Apply fix to both evaluate_model and inner training loop - Skip on all ranks if any rank has all-masked labels

joellidin added 5 commits January 18, 2026 03:57

Bump run version

e66da24

(hparams) Update anneal scheduler

ea3ec3d

Change from 120 to 150 to mitigate for now alwyas gathering the full 20 peers.

fix/nan loss (#685)

b48cad2

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

neurons/trainer.py Show resolved Hide resolved

joellidin merged commit 8dafcd2 into main Jan 18, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.27#686

v2.1.27#686
joellidin merged 6 commits intomainfrom
dev

joellidin commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 18, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joellidin commented Jan 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Branch Naming

Commit Messages

Code Quality

Testing

Documentation

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joellidin commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

codecov bot commented Jan 18, 2026 •

edited

Loading