fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks) by henryiii · Pull Request #1657 · scikit-hep/uproot5

henryiii · 2026-06-10T19:00:38Z

🤖 AI text below 🤖

This fixes a batch of verified bugs in the TTree-writing path (plus a couple in histogram/TGraph writing and compression). Each was reproduced with a write-then-read round-trip before fixing.

What was broken & fixed

1. Sliced string branches produced unreadable baskets (_cascadetree.py)
The >U0 string path of Tree.extend built big_endian_offsets from layout.offsets directly without normalizing offsets[0] != 0 (unlike the jagged path). Writing a sliced string array via mktree/extend produced a basket whose offsets section was malformed.
Reproducer (raised ValueError: When changing to a larger dtype... on read):

arr = ak.Array(["aaa","bb","cccc","dd","eeeee"])[2:]
with uproot.recreate(fn) as f:
    f.mktree("t", {"s": "string"}).extend({"s": arr})
uproot.open(fn)["t/s"].array()  # failed

Fix: convert ListArray/RegularArray/ListOffsetArray with to_ListOffsetArray64(True) so offsets start at zero and content is trimmed.

2. Nested structured arrays could not be written (recarray_to_dict)
for subfield_name, subfield in recarray_to_dict(field): iterated dict keys and tried to unpack each string → ValueError: not enough values to unpack. Changed to .items(). Verified with a nested np.dtype([('a', i4), ('nested', [('x', f8), ('y', i2)])]).

3. TLeafC fLen was shared across branches and overwritten per-extend (_cascadetree.py)
fLen (max string size) was stored in tree-level self._metadata["fLen"] and written into every >U0 branch, and overwritten (not max-accumulated) on each extend. Two string branches with max lengths 24 and 3 both ended up with the last basket's value. Now stored per-branch in datum["fLen"] and accumulated with max across extends; biggest-string computed via numpy.diff(offsets).max(). Verified two branches now report independent, correct fLen (25 and 41) accumulated across extends.

4. Wrong variable in duplicate-branch error message — {kk!r} → {k!r} (kk was only bound in the jagged sub-branch above).

5. Avoid an unnecessary copy — flat-branch path uses astype(datum["dtype"], copy=False).

6. numpy.linspace dtype fed into endpoint (identify.py, 7 sites)
numpy.linspace(fXmin, fXmax, len(edges), edges.dtype) passed the dtype object into the positional endpoint parameter (truthy, so it worked by accident, but the dtype was ignored). These only feed numpy.allclose regularity checks, so the argument was dropped. Verified regular/irregular histogram edge detection still behaves correctly.

7. TGraph default Y range used X (interpret.py)
_as_TGraph computed new_minY = np.min(x) / new_maxY = np.max(x) instead of y, so written TGraphs got a wildly wrong fMinimum/fMaximum display range. Fixed to use y. Verified fMinimum/fMaximum now reflect the padded Y range.

8. Oversized/incompressible compression blocks corrupted files (compression.py)
compress() packed len_compressed into a 3-byte header field via _4byte.pack(len_compressed)[:-1], silently truncating the high byte when an incompressible block's compressed size exceeded 2**24-1 (zlib/lz4 worst case grows the data). With a mix of compressible and incompressible blocks the whole-buffer len(out) < len(data) guard didn't catch it, producing an unreadable file.
Reproducer: zeros block + random incompressible block → compress() returned something smaller than the input but a corrupt second-block header (error -5 ... incomplete or truncated stream on read).
Fix: matching ROOT's whole-buffer decision, fall back to storing data uncompressed when any block fails to shrink or would overflow the 3-byte field. Verified round-trips for ZLIB/LZ4/ZSTD on the mixed case (raw fallback) and that ordinary compressible data still compresses and reads back.

Tests

Ran the existing writing suites (test_0099, test_0405, test_1128, test_0406, test_1264, test_0416, test_0414, test_0412, test_0940, test_0498, test_1604, test_0580, test_0351, test_0349, test_0352, test_0422, test_1000, test_1599, test_0014, test_0023) — all pass (ROOT-requiring tests skip locally, no ROOT installed). prek -a clean.

Skipped

Nothing from the list was skipped.

🤖 Generated with Claude Code

Part of #1646.

…, TGraph y-range, oversized compression blocks) - _cascadetree.py: normalize offsets[0] != 0 in the string-branch path of Tree.extend by converting to ListOffsetArray64(start_at_zero=True), so a sliced string array (e.g. ak.Array([...])[2:]) produces a readable basket. - _cascadetree.py: recarray_to_dict now iterates .items() of the nested dict instead of unpacking dict keys, fixing nested structured-array writing. - _cascadetree.py: store TLeafC fLen per-branch in datum["fLen"] and accumulate with max across extends, instead of overwriting a shared tree-level value; compute the biggest-string size via numpy.diff().max(). - _cascadetree.py: fix duplicate-branch error message ({kk!r} -> {k!r}) and use astype(..., copy=False) for the flat-branch path. - identify.py: pass len(edges) only to numpy.linspace for regularity checks (the dtype was being fed into the positional endpoint argument). - interpret.py: _as_TGraph default Y range now uses min(y)/max(y), not x. - compression.py: fall back to storing data uncompressed when a block does not shrink or its compressed size would overflow the 3-byte header field, which previously silently truncated the high byte and corrupted the block. Assisted-by: ClaudeCode:claude-opus-4-8

Covers sliced string branches, per-branch TLeafC fLen accumulation, nested structured-array recarray_to_dict, TGraph default Y range, histogram regular edge detection, and incompressible compression blocks (unit + end-to-end). Assisted-by: ClaudeCode:claude-opus-4-8

codecov · 2026-06-10T19:22:45Z

Codecov Report

❌ Patch coverage is 64.70588% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.79%. Comparing base (1c06db0) to head (404e332).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/uproot/writing/identify.py	14.28%	1 Missing and 5 partials ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
src/uproot/compression.py	`77.34% <100.00%> (+0.17%)`	⬆️
src/uproot/writing/_cascadetree.py	`84.02% <100.00%> (+0.32%)`	⬆️
src/uproot/writing/interpret.py	`53.39% <100.00%> (ø)`
src/uproot/writing/identify.py	`73.69% <14.28%> (+0.32%)`	⬆️

... and 1 file with indirect coverage changes

henryiii added 2 commits June 10, 2026 15:00

henryiii mentioned this pull request Jun 10, 2026

Claude Fable AI review on uproot #1646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657

fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657
henryiii wants to merge 2 commits into
mainfrom
fix-ttree-writing

henryiii commented Jun 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henryiii commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was broken & fixed

Tests

Skipped

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henryiii commented Jun 10, 2026 •

edited

Loading

codecov Bot commented Jun 10, 2026 •

edited

Loading