Fix IndexError in finetune scripts when last logit chunk becomes empty by Copilot · Pull Request #2141 · Lightning-AI/litgpt

Copilot · 2025-10-04T20:51:11Z

Problem

Users encountered an IndexError when running finetune_lora and other finetune scripts on Gemma models (and potentially other models with certain sequence lengths). The error occurred during training when processing batches with specific sequence lengths.

Root Cause

When lm_head_chunk_size=128 is used in the model forward pass, logits are returned as a list of chunks. The finetune code applies a shift operation to align predictions with targets:

logits = model(input_ids, lm_head_chunk_size=128)
logits[-1] = logits[-1][..., :-1, :]  # Shift to align output n with token n+1
loss = chunked_cross_entropy(logits, targets[..., 1:])

The bug: When the last chunk has a sequence length of exactly 1, the slicing operation [..., :-1, :] creates a chunk with length 0. This empty chunk then causes chunked_cross_entropy to fail because PyTorch's split() function doesn't accept a split size of 0.

This occurs when the total sequence length is of the form 128*n + 1 (e.g., 1, 129, 257, 385, etc.).

Solution

Added a simple check after the shift operation to remove empty chunks:

logits = model(input_ids, lm_head_chunk_size=128)
# shift the targets such that output n predicts token n+1
logits[-1] = logits[-1][..., :-1, :]
# Remove empty chunks (can happen when last chunk has size 1)
if logits[-1].size(1) == 0:
    logits = logits[:-1]
loss = chunked_cross_entropy(logits, targets[..., 1:])

This ensures all chunks passed to the loss function have non-zero sequence length.

Changes

Applied fix to all affected finetune scripts:
- litgpt/finetune/lora.py
- litgpt/finetune/adapter.py
- litgpt/finetune/adapter_v2.py
- litgpt/finetune/lora_legacy.py
- extensions/xla/finetune/adapter.py
Added comprehensive test test_chunked_cross_entropy_with_empty_last_chunk() to validate the fix

Impact

Fixes reported bug: Users can now finetune Gemma and other models without encountering IndexError on edge case sequence lengths
Minimal change: Only 3 lines added per file
Backward compatible: No changes to existing functionality for normal sequences
Model-agnostic: Benefits all models, not just Gemma

Fixes #[issue_number]

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

huggingface.co
- Triggering command: python3 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>finetune_lora on gemma bug</issue_title>
<issue_description>### Bug description

I am trying to use finetune_lora to do PEFT on gemma model, and I have tried:

litgpt0.5.8.dev1: gemma-3-12b-it, gemma-3-27b-it
litgpt0.5.7: gemma-2-27b-it

both encouter IndexError. I have also tried other series models like QwQ and llama etc, all look fine.
It seems some people met similar bug( but on gemma-7b), not sure whether they are some problem.

What operating system are you using?

Linux

LitGPT Version

litgpt0.5.7 & litgpt0.5.8.dev1


Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 1337
[rank: 1] Seed set to 1337
[rank: 2] Seed set to 1337
[rank: 3] Seed set to 1337
Number of trainable parameters: 10,616,832
Number of non-trainable parameters: 12,772,421,376
The longest sequence length in the train data is 460, the model's maximum sequence length is 460 and context length is 131072
Verifying settings ...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/bin/litgpt", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:              ^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/__main__.py", line 69, in main
[rank1]:     CLI(parser_data)
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 23, in CLI
[rank1]:     return auto_cli(*args, _stacklevel=3, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 125, in auto_cli
[rank1]:     return _run_component(component, init.get(subcommand))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/jsonargparse/_cli.py", line 210, in _run_component
[rank1]:     return component(**cfg)
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 170, in setup
[rank1]:     fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer, num_nodes)
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 837, in launch
[rank1]:     return self._wrap_and_launch(function, self, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 923, in _wrap_and_launch
[rank1]:     return to_run(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 928, in _wrap_with_setup
[rank1]:     return to_run(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 229, in main
[rank1]:     token_counts = fit(
[rank1]:                    ^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 298, in fit
[rank1]:     validate(fabric, model, val_dataloader, dataclasses.replace(eval, max_iters=2), verbose=False)  # sanity check
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/litgpt/finetune/lora.py", line 426, in validate
[rank1]:     logits = model(input_ids)
[rank1]:              ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/demo/miniconda3/envs/dobby_dev/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^...

</details>
Fixes Lightning-AI/litgpt#2020

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/Lightning-AI/litgpt/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

…3aee

Borda · 2025-11-11T09:49:00Z

@copilot try to split the test to multiple smaller for, each is focused only on a single case

…3aee

Copilot

Pull request overview

Fixes an edge case in the finetuning training loop when using lm_head_chunk_size=128: after shifting logits to align with next-token targets, the final logits chunk can become empty (when the last chunk has length 1), which can break downstream loss computation.

Changes:

Drop an empty final logits chunk after the shift step in all affected finetune scripts.
Add tests covering the “empty last chunk after shift” scenario.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`litgpt/finetune/lora.py`	Removes empty final logits chunk after shift before loss computation.
`litgpt/finetune/lora_legacy.py`	Same empty-chunk removal in legacy LoRA finetune loop.
`litgpt/finetune/adapter.py`	Same empty-chunk removal in Adapter finetune loop.
`litgpt/finetune/adapter_v2.py`	Same empty-chunk removal in Adapter v2 finetune loop.
`extensions/xla/finetune/adapter.py`	Same empty-chunk removal for XLA training loop.
`tests/test_utils.py`	Adds tests for loss correctness when the last logits chunk becomes empty after shift.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T05:40:40Z

+def test_chunked_cross_entropy_chunking_and_shift_T129():
+    """Test chunking and shift logic for T=129, ensuring last chunk becomes empty."""
+    B, V = 2, 100
+    lm_head_chunk_size = 128
+    T = 129
+    regular_logits = torch.randn(B, T, V)
+    targets = torch.randint(0, V, (B, T))
+
+    # Simulate chunking
+    chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))
+    assert len(chunked_logits) == 2
+    assert chunked_logits[-1].size(1) == 1
+
+    # Apply shift
+    chunked_logits[-1] = chunked_logits[-1][..., :-1, :]
+    assert chunked_logits[-1].size(1) == 0
+
+
+def test_chunked_cross_entropy_empty_removal_T129():
+    """Test empty chunk removal for T=129, resulting in a single remaining chunk."""
+    B, V = 2, 100
+    lm_head_chunk_size = 128
+    T = 129
+    regular_logits = torch.randn(B, T, V)
+
+    chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))
+    chunked_logits[-1] = chunked_logits[-1][..., :-1, :]  # Shift to make empty
+
+    # Apply removal
+    if chunked_logits[-1].size(1) == 0:
+        chunked_logits = chunked_logits[:-1]
+    assert len(chunked_logits) == 1
+
+
+def test_chunked_cross_entropy_loss_computation_T129():
+    """Test loss computation for T=129 after empty chunk removal."""
+    B, V = 2, 100
+    lm_head_chunk_size = 128
+    T = 129
+    regular_logits = torch.randn(B, T, V)
+    targets = torch.randint(0, V, (B, T))
+
+    chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))
+    chunked_logits[-1] = chunked_logits[-1][..., :-1, :]
+    if chunked_logits[-1].size(1) == 0:
+        chunked_logits = chunked_logits[:-1]
+


The block of tests starting here (T129/T257 chunking/shift/empty_removal/loss/baseline comparisons) duplicates the coverage already added earlier in this file by test_chunked_cross_entropy_with_empty_last_chunk_*. This adds a lot of redundant runtime and maintenance overhead; please remove these duplicates or collapse into a single parametrized test for T={129,257}.

Suggested change

def test_chunked_cross_entropy_chunking_and_shift_T129():

"""Test chunking and shift logic for T=129, ensuring last chunk becomes empty."""

B, V = 2, 100

lm_head_chunk_size = 128

T = 129

regular_logits = torch.randn(B, T, V)

targets = torch.randint(0, V, (B, T))

# Simulate chunking

chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))

assert len(chunked_logits) == 2

assert chunked_logits[-1].size(1) == 1

# Apply shift

chunked_logits[-1] = chunked_logits[-1][..., :-1, :]

assert chunked_logits[-1].size(1) == 0

def test_chunked_cross_entropy_empty_removal_T129():

"""Test empty chunk removal for T=129, resulting in a single remaining chunk."""

B, V = 2, 100

lm_head_chunk_size = 128

T = 129

regular_logits = torch.randn(B, T, V)

chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))

chunked_logits[-1] = chunked_logits[-1][..., :-1, :] # Shift to make empty

# Apply removal

if chunked_logits[-1].size(1) == 0:

chunked_logits = chunked_logits[:-1]

assert len(chunked_logits) == 1

def test_chunked_cross_entropy_loss_computation_T129():

"""Test loss computation for T=129 after empty chunk removal."""

B, V = 2, 100

lm_head_chunk_size = 128

T = 129

regular_logits = torch.randn(B, T, V)

targets = torch.randint(0, V, (B, T))

chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))

chunked_logits[-1] = chunked_logits[-1][..., :-1, :]

if chunked_logits[-1].size(1) == 0:

chunked_logits = chunked_logits[:-1]

@pytest.mark.parametrize("T", (129, 257))

def test_chunked_cross_entropy_empty_last_chunk_chunking_shift_and_loss(T):

"""Test empty-last-chunk handling after shift for sequence lengths just over a chunk boundary."""

B, V = 2, 100

lm_head_chunk_size = 128

regular_logits = torch.randn(B, T, V)

targets = torch.randint(0, V, (B, T))

chunked_logits = list(regular_logits.split(lm_head_chunk_size, dim=1))

assert len(chunked_logits) == (T + lm_head_chunk_size - 1) // lm_head_chunk_size

assert chunked_logits[-1].size(1) == 1

chunked_logits[-1] = chunked_logits[-1][..., :-1, :]

assert chunked_logits[-1].size(1) == 0

if chunked_logits[-1].size(1) == 0:

chunked_logits = chunked_logits[:-1]

assert len(chunked_logits) == T // lm_head_chunk_size

Initial plan

c623f17

Copilot AI assigned Copilot and Borda Oct 4, 2025

Copilot AI and others added 2 commits October 4, 2025 21:08

Fix IndexError in finetune with chunked logits on short sequences

38bae1f

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Add test for chunked cross entropy with empty last chunk

7f2aa36

Co-authored-by: Borda <6035284+Borda@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] finetune_lora on gemma bug~~ Fix IndexError in finetune scripts when last logit chunk becomes empty Oct 4, 2025

Copilot AI requested a review from Borda October 4, 2025 21:16

Merge branch 'main' into copilot/fix-9b767a1b-8173-4557-a643-cd2264d9…

3d33f95

…3aee

split

ad5e27b

Borda marked this pull request as ready for review November 11, 2025 10:19

Borda requested review from KaelanDt, andyland, k223kim, lantiga, lianakoleva and t-vi as code owners November 11, 2025 10:19

bhimrazy marked this pull request as draft January 8, 2026 06:53

Borda added 2 commits January 22, 2026 15:40

Merge branch 'main' into copilot/fix-9b767a1b-8173-4557-a643-cd2264d9…

6881134

…3aee

Merge branch 'main' into copilot/fix-9b767a1b-8173-4557-a643-cd2264d9…

6ec3463

…3aee

Borda marked this pull request as ready for review April 25, 2026 05:36

Copilot AI review requested due to automatic review settings April 25, 2026 05:36

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IndexError in finetune scripts when last logit chunk becomes empty#2141

Fix IndexError in finetune scripts when last logit chunk becomes empty#2141
Copilot wants to merge 7 commits into
mainfrom
copilot/fix-9b767a1b-8173-4557-a643-cd2264d93aee

Copilot AI commented Oct 4, 2025 •

edited

Loading

Uh oh!

Borda commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Changes

Impact

I tried to connect to the following addresses, but was blocked by firewall rules:

What operating system are you using?

LitGPT Version

Uh oh!

Borda commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Oct 4, 2025 •

edited

Loading