Conversation
Add guards in evaluate_model and inner_steps to prevent NaN loss when all labels in a batch are masked (-100). This occurs when batches contain only padding or special tokens. - Check valid_labels count before forward pass - Log warning and skip batch if valid_labels == 0 - Clean up tensors before continuing to next batch - Prevent cross_entropy from receiving empty loss target
Update miner and validator to use anneal shard 5 instead of shard 4. Update documentation to reflect the new shard number in rclone migration examples. - Change current_shard from 4 to 5 in miner.py - Change current_shard from 4 to 5 in validator.py - Update docs with anneal_000005.npy examples
Change from 120 to 150 to mitigate for now alwyas gathering the full 20 peers.
WalkthroughUpdates switch anneal-mode shard selection from 4→5 in runtime and docs, extend anneal decay_outer_steps 120→150, add distributed batch-skip guard for all-labels-masked batches in trainer, and bump package version to 2.1.27. Changes
Sequence Diagram(s)sequenceDiagram
participant Rank as Worker Rank
participant All as Collective (all_ok)
participant Master as Master / Logger
Rank->>Rank: prepare labels\n(has_valid_labels = (labels != -100).any())
Rank->>All: all_ok(has_valid_labels)
All-->>Rank: all_ok_result (bool)
alt all_ok_result == true
Rank->>Rank: proceed with forward/backward
else all_ok_result == false
Rank->>Master: (only master) log warning "dropping masked batch"
Rank->>Rank: delete input_ids, labels\ncontinue (skip batch)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (57.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage. @@ Coverage Diff @@
## main #686 +/- ##
=======================================
Coverage 57.69% 57.69%
=======================================
Files 27 27
Lines 4990 4990
=======================================
Hits 2879 2879
Misses 2111 2111
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@neurons/trainer.py`:
- Around line 854-862: The batch-skip decision is currently local (uses
valid_labels and batch_count) which can break collective ops later (e.g.,
ddp_reduce); make the skip decision collective by computing a tensor flag like
local_has_valid = torch.tensor(1 if (labels != -100).any() else 0,
device=labels.device) and all-reducing it across the process group (e.g.,
torch.distributed.all_reduce(local_has_valid, op=torch.distributed.ReduceOp.SUM)
or torch.distributed.reduce with world_size) to derive a single boolean
(any_rank_has_valid or all_ranks_have_no_valid) and then use that shared result
to either skip the batch on every rank (delete input_ids, labels and continue)
or proceed with the forward/backward and ddp_reduce; update the code paths
around valid_labels, batch_count, input_ids, labels, and any subsequent
ddp_reduce to rely on this collective decision.
Make batch-skip decision collective across all ranks to prevent DDP collective operation failures. Previously, each rank independently decided to skip batches with all-masked labels, which could cause some ranks to skip while others continued, breaking ddp_reduce. - Use dist_helper.all_ok() to synchronize skip decision - Apply fix to both evaluate_model and inner training loop - Skip on all ranks if any rank has all-masked labels
Description
Related Issue(s)
Type of Change
Branch Naming
Commit Messages
Code Quality
Testing
Documentation
If this is a breaking change
Screenshots/Examples
Additional Notes
Summary by CodeRabbit
New Features
Chores
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.