fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657
Open
henryiii wants to merge 2 commits into
Open
fix: TTree-writing bugs (sliced string arrays, nested recarrays, fLen, TGraph y-range, oversized compression blocks)#1657henryiii wants to merge 2 commits into
henryiii wants to merge 2 commits into
Conversation
…, TGraph y-range, oversized compression blocks)
- _cascadetree.py: normalize offsets[0] != 0 in the string-branch path of
Tree.extend by converting to ListOffsetArray64(start_at_zero=True), so a
sliced string array (e.g. ak.Array([...])[2:]) produces a readable basket.
- _cascadetree.py: recarray_to_dict now iterates .items() of the nested dict
instead of unpacking dict keys, fixing nested structured-array writing.
- _cascadetree.py: store TLeafC fLen per-branch in datum["fLen"] and accumulate
with max across extends, instead of overwriting a shared tree-level value;
compute the biggest-string size via numpy.diff().max().
- _cascadetree.py: fix duplicate-branch error message ({kk!r} -> {k!r}) and use
astype(..., copy=False) for the flat-branch path.
- identify.py: pass len(edges) only to numpy.linspace for regularity checks
(the dtype was being fed into the positional endpoint argument).
- interpret.py: _as_TGraph default Y range now uses min(y)/max(y), not x.
- compression.py: fall back to storing data uncompressed when a block does not
shrink or its compressed size would overflow the 3-byte header field, which
previously silently truncated the high byte and corrupted the block.
Assisted-by: ClaudeCode:claude-opus-4-8
Covers sliced string branches, per-branch TLeafC fLen accumulation, nested structured-array recarray_to_dict, TGraph default Y range, histogram regular edge detection, and incompressible compression blocks (unit + end-to-end). Assisted-by: ClaudeCode:claude-opus-4-8
Codecov Report❌ Patch coverage is
Additional details and impacted files
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 AI text below 🤖
This fixes a batch of verified bugs in the TTree-writing path (plus a couple in histogram/TGraph writing and compression). Each was reproduced with a write-then-read round-trip before fixing.
What was broken & fixed
1. Sliced string branches produced unreadable baskets (
_cascadetree.py)The
>U0string path ofTree.extendbuiltbig_endian_offsetsfromlayout.offsetsdirectly without normalizingoffsets[0] != 0(unlike the jagged path). Writing a sliced string array viamktree/extendproduced a basket whose offsets section was malformed.Reproducer (raised
ValueError: When changing to a larger dtype...on read):Fix: convert
ListArray/RegularArray/ListOffsetArraywithto_ListOffsetArray64(True)so offsets start at zero and content is trimmed.2. Nested structured arrays could not be written (
recarray_to_dict)for subfield_name, subfield in recarray_to_dict(field):iterated dict keys and tried to unpack each string →ValueError: not enough values to unpack. Changed to.items(). Verified with a nestednp.dtype([('a', i4), ('nested', [('x', f8), ('y', i2)])]).3. TLeafC
fLenwas shared across branches and overwritten per-extend (_cascadetree.py)fLen(max string size) was stored in tree-levelself._metadata["fLen"]and written into every>U0branch, and overwritten (not max-accumulated) on each extend. Two string branches with max lengths 24 and 3 both ended up with the last basket's value. Now stored per-branch indatum["fLen"]and accumulated withmaxacross extends; biggest-string computed vianumpy.diff(offsets).max(). Verified two branches now report independent, correctfLen(25 and 41) accumulated across extends.4. Wrong variable in duplicate-branch error message —
{kk!r}→{k!r}(kkwas only bound in the jagged sub-branch above).5. Avoid an unnecessary copy — flat-branch path uses
astype(datum["dtype"], copy=False).6.
numpy.linspacedtype fed intoendpoint(identify.py, 7 sites)numpy.linspace(fXmin, fXmax, len(edges), edges.dtype)passed the dtype object into the positionalendpointparameter (truthy, so it worked by accident, but the dtype was ignored). These only feednumpy.allcloseregularity checks, so the argument was dropped. Verified regular/irregular histogram edge detection still behaves correctly.7. TGraph default Y range used X (
interpret.py)_as_TGraphcomputednew_minY = np.min(x)/new_maxY = np.max(x)instead ofy, so written TGraphs got a wildly wrongfMinimum/fMaximumdisplay range. Fixed to usey. VerifiedfMinimum/fMaximumnow reflect the padded Y range.8. Oversized/incompressible compression blocks corrupted files (
compression.py)compress()packedlen_compressedinto a 3-byte header field via_4byte.pack(len_compressed)[:-1], silently truncating the high byte when an incompressible block's compressed size exceeded2**24-1(zlib/lz4 worst case grows the data). With a mix of compressible and incompressible blocks the whole-bufferlen(out) < len(data)guard didn't catch it, producing an unreadable file.Reproducer: zeros block + random incompressible block →
compress()returned something smaller than the input but a corrupt second-block header (error -5 ... incomplete or truncated streamon read).Fix: matching ROOT's whole-buffer decision, fall back to storing
datauncompressed when any block fails to shrink or would overflow the 3-byte field. Verified round-trips for ZLIB/LZ4/ZSTD on the mixed case (raw fallback) and that ordinary compressible data still compresses and reads back.Tests
Ran the existing writing suites (
test_0099,test_0405,test_1128,test_0406,test_1264,test_0416,test_0414,test_0412,test_0940,test_0498,test_1604,test_0580,test_0351,test_0349,test_0352,test_0422,test_1000,test_1599,test_0014,test_0023) — all pass (ROOT-requiring tests skip locally, no ROOT installed).prek -aclean.Skipped
Nothing from the list was skipped.
🤖 Generated with Claude Code
Part of #1646.