Skip to content

fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657

Open
henryiii wants to merge 2 commits into
mainfrom
fix-ttree-writing
Open

fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657
henryiii wants to merge 2 commits into
mainfrom
fix-ttree-writing

Conversation

@henryiii

@henryiii henryiii commented Jun 10, 2026

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

This fixes a batch of verified bugs in the TTree-writing path (plus a couple in histogram/TGraph writing and compression). Each was reproduced with a write-then-read round-trip before fixing.

What was broken & fixed

1. Sliced string branches produced unreadable baskets (_cascadetree.py)
The >U0 string path of Tree.extend built big_endian_offsets from layout.offsets directly without normalizing offsets[0] != 0 (unlike the jagged path). Writing a sliced string array via mktree/extend produced a basket whose offsets section was malformed.
Reproducer (raised ValueError: When changing to a larger dtype... on read):

arr = ak.Array(["aaa","bb","cccc","dd","eeeee"])[2:]
with uproot.recreate(fn) as f:
    f.mktree("t", {"s": "string"}).extend({"s": arr})
uproot.open(fn)["t/s"].array()  # failed

Fix: convert ListArray/RegularArray/ListOffsetArray with to_ListOffsetArray64(True) so offsets start at zero and content is trimmed.

2. Nested structured arrays could not be written (recarray_to_dict)
for subfield_name, subfield in recarray_to_dict(field): iterated dict keys and tried to unpack each string → ValueError: not enough values to unpack. Changed to .items(). Verified with a nested np.dtype([('a', i4), ('nested', [('x', f8), ('y', i2)])]).

3. TLeafC fLen was shared across branches and overwritten per-extend (_cascadetree.py)
fLen (max string size) was stored in tree-level self._metadata["fLen"] and written into every >U0 branch, and overwritten (not max-accumulated) on each extend. Two string branches with max lengths 24 and 3 both ended up with the last basket's value. Now stored per-branch in datum["fLen"] and accumulated with max across extends; biggest-string computed via numpy.diff(offsets).max(). Verified two branches now report independent, correct fLen (25 and 41) accumulated across extends.

4. Wrong variable in duplicate-branch error message{kk!r}{k!r} (kk was only bound in the jagged sub-branch above).

5. Avoid an unnecessary copy — flat-branch path uses astype(datum["dtype"], copy=False).

6. numpy.linspace dtype fed into endpoint (identify.py, 7 sites)
numpy.linspace(fXmin, fXmax, len(edges), edges.dtype) passed the dtype object into the positional endpoint parameter (truthy, so it worked by accident, but the dtype was ignored). These only feed numpy.allclose regularity checks, so the argument was dropped. Verified regular/irregular histogram edge detection still behaves correctly.

7. TGraph default Y range used X (interpret.py)
_as_TGraph computed new_minY = np.min(x) / new_maxY = np.max(x) instead of y, so written TGraphs got a wildly wrong fMinimum/fMaximum display range. Fixed to use y. Verified fMinimum/fMaximum now reflect the padded Y range.

8. Oversized/incompressible compression blocks corrupted files (compression.py)
compress() packed len_compressed into a 3-byte header field via _4byte.pack(len_compressed)[:-1], silently truncating the high byte when an incompressible block's compressed size exceeded 2**24-1 (zlib/lz4 worst case grows the data). With a mix of compressible and incompressible blocks the whole-buffer len(out) < len(data) guard didn't catch it, producing an unreadable file.
Reproducer: zeros block + random incompressible block → compress() returned something smaller than the input but a corrupt second-block header (error -5 ... incomplete or truncated stream on read).
Fix: matching ROOT's whole-buffer decision, fall back to storing data uncompressed when any block fails to shrink or would overflow the 3-byte field. Verified round-trips for ZLIB/LZ4/ZSTD on the mixed case (raw fallback) and that ordinary compressible data still compresses and reads back.

Tests

Ran the existing writing suites (test_0099, test_0405, test_1128, test_0406, test_1264, test_0416, test_0414, test_0412, test_0940, test_0498, test_1604, test_0580, test_0351, test_0349, test_0352, test_0422, test_1000, test_1599, test_0014, test_0023) — all pass (ROOT-requiring tests skip locally, no ROOT installed). prek -a clean.

Skipped

Nothing from the list was skipped.

🤖 Generated with Claude Code

Part of #1646.

henryiii added 2 commits June 10, 2026 15:00
…, TGraph y-range, oversized compression blocks)

- _cascadetree.py: normalize offsets[0] != 0 in the string-branch path of
  Tree.extend by converting to ListOffsetArray64(start_at_zero=True), so a
  sliced string array (e.g. ak.Array([...])[2:]) produces a readable basket.
- _cascadetree.py: recarray_to_dict now iterates .items() of the nested dict
  instead of unpacking dict keys, fixing nested structured-array writing.
- _cascadetree.py: store TLeafC fLen per-branch in datum["fLen"] and accumulate
  with max across extends, instead of overwriting a shared tree-level value;
  compute the biggest-string size via numpy.diff().max().
- _cascadetree.py: fix duplicate-branch error message ({kk!r} -> {k!r}) and use
  astype(..., copy=False) for the flat-branch path.
- identify.py: pass len(edges) only to numpy.linspace for regularity checks
  (the dtype was being fed into the positional endpoint argument).
- interpret.py: _as_TGraph default Y range now uses min(y)/max(y), not x.
- compression.py: fall back to storing data uncompressed when a block does not
  shrink or its compressed size would overflow the 3-byte header field, which
  previously silently truncated the high byte and corrupted the block.

Assisted-by: ClaudeCode:claude-opus-4-8
Covers sliced string branches, per-branch TLeafC fLen accumulation, nested
structured-array recarray_to_dict, TGraph default Y range, histogram regular
edge detection, and incompressible compression blocks (unit + end-to-end).

Assisted-by: ClaudeCode:claude-opus-4-8
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.70588% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.79%. Comparing base (1c06db0) to head (404e332).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/uproot/writing/identify.py 14.28% 1 Missing and 5 partials ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
src/uproot/compression.py 77.34% <100.00%> (+0.17%) ⬆️
src/uproot/writing/_cascadetree.py 84.02% <100.00%> (+0.32%) ⬆️
src/uproot/writing/interpret.py 53.39% <100.00%> (ø)
src/uproot/writing/identify.py 73.69% <14.28%> (+0.32%) ⬆️

... and 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant