feat(transaction): with_snapshot_id() across actions, MergeAppend auto-compaction by srinath-prabhu · Pull Request #8 · e6data/iceberg-rust

srinath-prabhu · 2026-06-04T09:05:03Z

Summary

MergeAppend — auto-compact manifests at commit time (6da0719)
FastAppendAction.with_snapshot_id() + public generate_unique_snapshot_id (bd7461c)
Mirror with_snapshot_id() on ReplaceDataFilesAction (fc85a11)
Mirror with_snapshot_id() on RewriteManifestsAction (6d3124e)

⚠️ Important: reverted commits on main must be restored at merge time

Two fixes were originally on main but then reverted there (they remain active in this branch's history):

b63983e fix(transaction): allow multiple StatisticsFile entries per snapshot — reverted by cc16218
c22c418 fix: prevent u64 underflow in snapshot summary total-data-files — reverted by ff4c1c1

Because the original commits are already in main's history (just reverted), merging this PR will NOT re-apply those two changes automatically — git treats them as already-merged-then-reverted. When merging this PR, you must also revert the reverts on main:

git revert ff4c1c1 cc16218

(or cherry-pick the original changes back). Do not skip this step, or the StatisticsFile and u64-underflow fixes will silently stay missing from main.

🤖 Generated with Claude Code

Implements manifest merging matching Java SDK's MergingSnapshotProducer: - After each commit, checks manifest count vs min-count-to-merge (default 100) - Groups small manifests (<target-size-bytes, default 8MB) by partition spec - Bin-packs and merges them into target-sized manifests - Entries change status Added → Existing in merged manifests - Replaces both try_merge_into_existing (crude) and Tessellate manifest rewriting (external) Controlled by existing table properties: commit.manifest-merge.enabled = true commit.manifest.min-count-to-merge = 100 (default) commit.manifest.target-size-bytes = 8388608 (8MB default) For our scenario (6 manifests/commit, 30s interval): Merge triggers every ~16 commits (8 min) Merges ~100 small manifests into ~1 target-sized manifest Amortized cost: ~9ms/commit (one S3 batch read + one write) Manifest count: bounded at ~100 instead of growing unbounded

…ate_unique_snapshot_id Lets callers pre-allocate the new snapshot's id before commit so other parts of the same transaction can reference it. Concrete need: when a `FastAppendAction` runs together with `update_statistics().set_statistics(...)` in one transaction, the `StatisticsFile` entries key the per-snapshot map on `snapshot_id`. Without this API the caller has no way to know the action's snapshot_id pre-commit, so every entry gets registered under `snapshot_id=0` and `metadata.statistics_for_snapshot(current)` returns `None`. Pattern: use iceberg::transaction::generate_unique_snapshot_id; let snapshot_id = generate_unique_snapshot_id(&table); // ... attach snapshot_id to each StatisticsFile ... let action = tx.fast_append() .with_snapshot_id(snapshot_id) .add_data_files(files); Mirrors the same pre-allocation pattern RewriteManifestsAction already uses via the internal `generate_unique_snapshot_id_static`; this just exposes it through a clean public API + plumbs the override through FastAppendAction → SnapshotProducer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same shape as the FastAppendAction builder from 4a9a5b1. Lets a compaction commit (laminar's merge-on-write) pre-allocate the new snapshot's id so the same transaction can register carry-forward `StatisticsFile` entries under that snapshot — closing the per-snapshot Puffin stats gap where compaction snapshots become the catalog's `current-snapshot-id` without any stats entry, which would otherwise force every reader to walk the parent chain or fall back to a SQL scan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same shape as the FastAppendAction (4a9a5b1) and ReplaceDataFilesAction (421ce93) builders. Lets laminar's manifest-compaction maintenance loop pre-allocate the new snapshot's id so the same caller can register a carry-forward StatisticsFile entry under that snapshot — closing the per-snapshot Puffin stats gap on the third commit path. Without this, every manifest compaction (runs every 600s when manifest count > 100) produces a new current-snapshot-id with no statistics-files entry. The executor's `metadata.statistics_for_snapshot(current)` returns None, the label_values fast path falls back to a parquet scan, and stays in the fallback until the chain rolls forward to a FastAppend that re-attaches stats. Both commit paths inside this file (the two-phase `execute()` for concurrent-write resilience, and the simpler `TransactionAction::commit()` fallback) consume the override via `unwrap_or_else`, so the random id generation is preserved when no override is set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Implements RootManifest types and Parquet serialization for the v4 root manifest that replaces the manifest list layer. Supports three entry types in one Parquet file: manifest references, inline data files, and inline delete files. Includes MDV bitmap column for manifest delete vectors and a to_manifest_list() shim for backward compatibility with existing scan code. 7 tests covering round-trip, projection, MDV bitmap, and shim.

Adds V4 = 4 to FormatVersion enum. All existing match arms treat V4 identically to V3 for now — the V4-specific root manifest commit path will be added in a subsequent commit. This ensures V4 tables can be created and operate through the existing V3 code paths as a baseline.

SnapshotProducer.commit() now dispatches to commit_v4() for V4 tables. The V4 path writes a single root manifest Parquet file containing: - Inline data/delete file entries for new files - Carried-forward manifest references from the previous snapshot - Handles V3→V4 upgrade by converting manifest list entries to refs This reduces commit overhead from 3+ S3 PUTs (manifest + manifest list + metadata.json) to 1 PUT (root manifest) + 1 CAS (metadata pointer). 1076 tests passing, backward compatible with V1-V3 tables.

Adds ManifestDeleteVector struct using roaring bitmaps for soft-deleting entries in child manifests without rewriting them. V4 commit path now handles removed_data_files by: - Removing matching inline entries directly - Building MDV bitmaps for affected child manifest references - Merging with existing MDVs on subsequent compactions This eliminates manifest rewrite amplification during compaction — instead of rewriting N manifest files, only the root manifest is updated with MDV bitmaps marking deleted row indices.

Adds RebalanceRootManifestAction that flushes accumulated inline entries into child manifest files and compacts MDV-heavy manifests: - Inline entries grouped by content type and written as Parquet manifests - Manifest refs with MDV deleted fraction > threshold are rewritten without deleted entries, clearing the MDV - Configurable thresholds: inline_threshold (1000), mdv_compaction (0.3) - No-op when rebalance not needed (inline count below threshold) Available via Transaction::rebalance_root_manifest(). 5 unit tests.

Adds 2 new tests: - mdv_comprehensive: empty, mark, is_deleted, serialize/deserialize round-trip, merge, deleted_fraction, idempotent insert - remove_inline_files: verifies inline entries are removed by path while manifest refs are preserved Total v4-related tests: 14 (7 root manifest + 5 rebalance + 2 new)

Change 1: Cache-forward root manifest entries. commit_v4() accepts pre-cached entries via with_cached_root_entries(), skipping S3 read. ActionCommit returns final entries for the caller to cache. First commit reads from S3, every subsequent commit is zero S3 reads. Change 5: File-to-manifest index for MDV. SnapshotProducer accepts an optional HashMap<file_path, manifest_path> so the MDV block only scans manifests that contain removed files (typically 1-3) instead of all manifests (potentially hundreds). Change 6: Summary updated after MDV. Tracks total_mdv_deleted_files and total_mdv_deleted_rows during the MDV block and inserts them into the snapshot summary so total-data-files reflects logical deletions.

Change 2: Root manifest now uses two row groups instead of a 38-column discriminator schema. Row group 0 = manifest refs (17 cols), row group 1 = inline entries (21 cols). ~45% smaller files, no null column overhead, reader can skip irrelevant row group. Change 3: Adaptive inline→child flush inside commit_v4(). When inline count exceeds root-manifest.inline-threshold (default 500), inline entries are flushed to a child Parquet manifest and replaced with a single manifest ref — all within the same commit. Zero separate table maintenance needed for append-only streaming workloads.

Change 4: to_manifest_list() now computes FieldSummary (min/max partition bounds) from inline data entries using PartitionFieldStats. This enables the query planner to prune inline entries by partition value — for observability tables partitioned by (tenant, hour), a query for the last hour skips entries from other hours. Made PartitionFieldStats pub(super) so root_manifest.rs can reuse the same partition bounds logic that ManifestWriter uses.

…fallback

- Merge small manifest refs after adaptive flush to keep ref count bounded - Fix cache validation: discard cache on first commit (no current snapshot) - Add row lineage (first_row_id + added_rows) to V4 snapshots - Add V4 fast_append integration test (1 passing, 1 ignored pending FileIO fix)

- Native V4 scan path: inline entries injected directly into scan pipeline as ManifestEntryContext, bypassing ManifestFile::load_manifest - V4-aware duplicate validation checks both manifest refs and inlines - MDV handling on read path: bitmaps threaded through ManifestFileContext, applied during manifest entry streaming to skip soft-deleted rows - Cache validation uses explicit (snapshot_id, entries) tuple instead of deriving from inline entry inspection (TOCTOU fix) - added_rows counts only current-snapshot entries, not carried-forward - RebalanceRootManifestAction sets row_range to prevent serialization crash - read_root_manifest accepts Bytes (zero-copy via Arc clone) - manifest_entries_to_record_batch generic over Borrow<ManifestEntry>, eliminating deep clone of all inline entries on every commit - Object cache weigher uses estimated_size() accounting for heap allocations - Default inline threshold lowered from 500 to 100 - TableMetadataV4 struct with VersionNumber<4> for correct serialization - Removed dead to_manifest_list() shim and unused rebalance variables

…riendly) Adds `Table::effective_format_version()` and a custom table property (`e6.actual-format-version`) that lets a table declare V2/V3 to the catalog wire format while internally taking the V4 commit + rebalance paths. Motivation: catalogs that pre-date V4 (e.g. Lakekeeper at commit bb70173) reject `format-version=4` at CREATE TABLE time: crates/lakekeeper/src/server/tables/create_table.rs:440-444 match v.as_str() { "v1"|"1" => V1, "v2"|"2" => V2, "v3"|"3" => V3, _ => Err("InvalidFormatVersion") } But everything below that gate is opaque to such catalogs: - manifest_list is stored as a plain `String` path -- Lakekeeper never opens the file (grep across its source: zero references to `manifest_list`, `load_manifest`, or `ManifestList`). - Snapshot expiration / drop_table only touch metadata.json; no manifest file walks. - No `deny_unknown_fields` anywhere in lakekeeper's iceberg-rust fork -- unknown table properties round-trip cleanly. - SnapshotV3 already accepts every field our V4 commit produces (manifest_list, summary, schema_id, row_range), so V4 snapshots serialise as valid V3 JSON. So V4's "wire format" against Lakekeeper is identical to V3 plus a single opt-in property; the actual V4 mechanics (Parquet root manifest, MDV bitmaps, single-file commits) live in object storage where the catalog never reaches. The change is intentionally narrow: only the two BEHAVIOUR dispatch sites switch over to `effective_format_version()`: - `SnapshotProducer::commit()` (snapshot.rs:914) -- the "if V4 use commit_v4 else use the manifest-list path" branch. - `RebalanceRootManifestAction::commit()` (rebalance_root_manifest.rs:149) -- the feature gate. All other `metadata().format_version()` call sites stay as-is because they decide MANIFEST-FILE shape (V3 manifest entries are correct for both V3 and V4 tables -- the existing `V3 | V4 => build_v3_data()` match arms remain accurate). Properties: - `e6.actual-format-version` -- only value `"4"` is honoured. Any other value (including `"3"`, `"v4"`, `" 4"`, `""`) falls back to the declared version; never silently downgrades. - When the declared version is already V4, the property is redundant -- legacy tests / file-system catalog use cases continue to work unchanged. Constraints callers must respect: - Never emit `TableUpdate::UpgradeFormatVersion { V4 }` against a V3-declaring catalog; Lakekeeper's enum stops at V3 and it would be rejected. - The opt-in is invisible to non-e6 readers (Trino, Spark via upstream iceberg lib). Those will see `format-version=3`, try to read `manifest_list` as Avro, and fail. Tables using this property MUST only be served by readers that honour `effective_format_version()`. 4 new table-level unit tests cover the precedence rules + bogus value fallback. Full lib suite: 1092 passed, 0 failed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous opt-in patch (8f6a35f) wired write-side dispatch through `Table::effective_format_version()` but left the read-side `Snapshot::load_manifest_list` still gating on the catalog-declared `TableMetadata::format_version()`. The net result: a table that declares V3 to a strict catalog (e.g. Lakekeeper pre-V4) and carries `e6.actual-format-version=4` would WRITE a Parquet root manifest on commit (V4 path) and then FAIL on the next scan trying to parse it as Avro (V3 path). Refactor: - Move the precedence rule onto `TableMetadata::effective_format_version` -- the single source of truth. `Table::effective_format_version` becomes a thin delegate. - Move the V4 opt-in constants (`E6_ACTUAL_FORMAT_VERSION_KEY`, `E6_ACTUAL_FORMAT_VERSION_V4_VALUE`) next to the impl in `crate::spec::table_metadata`. - Re-export `E6_ACTUAL_FORMAT_VERSION_KEY` from `crate::table` for the historical import path. - Update `Snapshot::load_manifest_list` to dispatch on `effective_format_version()` -- this is the actual bug fix. Why on TableMetadata vs Table: read paths that hold only a `&TableMetadata` (no `&Table`), like `Snapshot::load_manifest_list`, need to query the effective version without going through the Table wrapper. Putting the impl on TableMetadata makes it accessible to every dispatch site that already holds metadata, which is the natural granularity for format-version decisions. The 4 table-level tests still pass unchanged (they assert the same precedence rule via the Table delegate). Full lib suite: 1092 passed, 0 failed. Existing V4 commit + rebalance unit tests untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rquet path The V4 inline-overflow code path in commit_v4 templated the child manifest path with `.parquet` extension but invoked the Avro writer (`write_manifest_file()`), producing files named `.parquet` that contain Avro magic bytes (`Obj\x01`). Any downstream reader dispatched by file extension -- iceberg-rust's own Parquet manifest reader (read_parquet_manifest), tessellate's live-set scan, the executor's scan path, the V4 root-manifest load path -- opens them as Parquet and fails with: DataInvalid => Failed to open parquet manifest: Parquet error: Invalid Parquet file. Corrupt footer Reproduced live in the sri-olly stack: 019e9909-5e08-...-m0.parquet (139 KiB) head bytes: 4f62 6a01 → "Obj\x01" (Avro magic, not "PAR1") `RebalanceRootManifestAction` already takes the Parquet path (`write_manifest_file_parquet()`) for both Phase A and Phase B child-manifest writes. The fix is to line commit_v4 up with that convention -- V4 child manifests are unconditionally Parquet, same as V4 root manifests. The .parquet path stays correct. Both data-entry flush (snapshot.rs:1256) and delete-entry flush (:1283) sites updated. 1092 lib tests pass unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e-entry path TableScan::plan_files spawned two equivalent consumer tasks for the manifest-entry channels (delete and data), but the delete one was written as `spawn(...).await` while the data one was fire-and-forget `spawn(...)`. Same shape downstream, opposite ownership upstream. With V4 root manifests carrying inline entries (no child manifests), plan_files spawns a third task at lines ~381 that sends ALL inline deletes then ALL inline data on tx clones it OWNS for the duration of its body. With bounded channels sized to `concurrency_limit_manifest_files` (executor uses 4), the inline spawn's `inline_data_tx.send().await` blocks once the data channel fills past 4 entries -- it's waiting for a consumer. The data-process spawn would be that consumer, but plan_files is still stuck on the .await on the delete-process spawn at the line under change. The delete-process spawn is waiting for the delete channel to close. The delete channel closes only when inline_delete_tx drops. inline_delete_tx is held by the inline spawn, which is blocked waiting for the data consumer. Four-way deadlock; plan_files hangs forever with no progress and no error. Live-confirmed on sri-olly's observability.logs table immediately after a clean reset: V4 root manifest: 68 inline file entries, 0 child manifests scan.with_concurrency_limit(4) -> plan_files start log fires -> 10+ minutes elapse, no progress, no error -> executor's discover_files: manifest list stats reports manifest_count=0 added_files_total=0 (entries() correctly enumerates only child manifests; inline are a separate set) Fix: drop the synchronous .await on the delete-process spawn. Both consumers now run concurrently, matching the data-process spawn's existing pattern. The inline spawn can drain its data channel, finish its sends, drop both tx handles, and the delete channel closes naturally -- terminating the delete consumer on its own without blocking plan_files. The .await was likely a copy-paste leftover from an earlier sequential refactor -- the comment block on the surrounding spawn already says "in parallel" and the data sibling is correct. Adds a multi-paragraph code comment to call out the deadlock so future edits don't reintroduce it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…deadlock past 32 entries Sibling deadlock to e84ed18 (which removed the outer .await on the delete-process spawn). Same root pattern: `inline_delete_tx` is held across the inline-data send loop and prevents the delete channel from closing, which prevents the DeleteFileIndex from transitioning to Populated, which leaves every `process_data_manifest_entry`'s `into_file_scan_task().await -> DeleteFileIndex::get_deletes_for_data_file` call stuck on `notifier.notified().await` forever. The threshold is silent and exact: deadlock when inline_data_contexts.len() > concurrency_limit_manifest_files + concurrency_limit_manifest_entries Live-reproduced on sri-olly's V4 attribute_index_logs (concurrency=16): 1-hour probe (~23 inline entries, under 32): 115ms total load_index_ms=14 plan_ms=32 collect_ms=0 3-hour probe (66 inline entries, over 32): hangs to the executor's 10s INDEX_PROBE_TIMEOUT ceiling 100% of the time. Every attempt falls back to no-pruning, Grafana times out client-side. Fix is structural, not a knob. The inline spawn now drops inline_delete_tx the moment it finishes sending inline DELETE entries (which is the empty Vec on every append-only table -- the dominant case for V4-on-low-volume tables like attribute_index_*). With the tx gone, the delete channel closes immediately, the DeleteFileIndex populates with an empty set, and every `get_deletes_for_data_file` call returns Vec::new instead of blocking. The inline-data loop then proceeds with the data consumer unblocked. Mixed case (some inline delete entries + many inline data entries): order-preserving — deletes still send first, just don't outlive their loop. The new drop is between the loops, not before the delete loop. Inline deletes are extremely rare in practice (V4 inline-entries are the small-volume tail of fresh commits, which on append-only tables carry no deletes), so this path stays a no-op for the workloads it matters for. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

External tooling that copies parquet row groups verbatim (no decode/ re-encode) needs to construct iceberg::spec::DataFile entries purely from the output file's parquet footer. Tessellate's planned streaming row-group concat path is the immediate consumer — it eliminates the ~1 GB-per-file Arrow decode that today's compact_batch hits, allowing cycles to complete inside their k8s deadline. The function body is unchanged; only the visibility moves from pub(crate) to pub. The doc comment now references the consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per spec, multiple StatisticsFile entries may share the same snapshot_id as long as their statistics_path differs. The prior storage `HashMap<i64, StatisticsFile>` collapsed every second-and- later set_statistics() call for the same snapshot to last-write-wins (the index_statistics .rev() trick made it first-wins, same problem), silently dropping N-1 of every N entries before they ever reached the catalog. The b63983e action-layer fix changed action storage to Vec but TableMetadata still collapsed — the fix was incomplete and got reverted on main (cc16218) without diagnosis. This corrects it where the collapse actually lived. Storage: `HashMap<(i64, String), StatisticsFile>` keyed on the full spec key. Set N times with same snapshot_id but distinct paths → keeps all N. Set N times with the SAME (snapshot_id, path) → still upserts (correct: that's a true update). API: - `statistics_for_snapshot(snapshot_id) -> Option<&StatisticsFile>` kept for back-compat: returns the first match in HashMap iteration order (non-deterministic). Downstream code that only ever expected one entry continues to compile and read SOMETHING. - New `all_statistics_for_snapshot(snapshot_id) -> Vec<&StatisticsFile>` returns the full per-spec set. Callers (laminar's per-output merge sidecars, tessellate's future per-partition puffin path) should migrate to this. Builder: - `set_statistics`: inserts on composite key — append-not-overwrite for distinct paths, true update for same path. - `remove_statistics(snapshot_id)`: still snapshot-id-keyed (mirrors catalog protocol: RemoveStatistics is keyed by snapshot_id alone). Drops every entry matching the id via `retain`. Emits exactly one TableUpdate::RemoveStatistics iff anything was removed. Wire JSON unchanged: iceberg spec already represents `statistics` as a list. Old/new readers interop on JSON; only internal Rust storage shape changed. Tests: existing test_statistics (single-entry case) updated for new key shape; new test_set_multiple_statistics_same_snapshot_different_paths locks in the three-distinct-paths regression that motivated the fix. All 1092 iceberg lib tests pass. This is Phase 1 of B2 (multi-stats per snapshot end-to-end). Phase 2: Lakekeeper PG migration to add statistics_path to the table_statistics PK. Phase 3: Lakekeeper code rebuild. Phase 4: downstream rebuilds. Phase 5: tessellate refactor to write per- partition puffin during compact_batch instead of accumulating blobs in memory (closes the OOM root cause). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lumbing opendal 0.57 ships a native `credential_provider_chain` for S3 that covers IRSA, EKS Pod Identity, EC2 instance metadata, env vars, and shared-credentials files with auto-refresh — the same responsibility our hand-rolled `CustomAwsCredentialLoader` / `customized_credential_load` extension was filling under opendal 0.55. The chain is now built-in, so this iceberg-rust hook is dead weight and is removed. Changes in this commit: * Workspace `opendal = "0.55.0"` → `"0.57.0"`. The `Scheme` enum was removed in 0.57 (services moved to per-crate `opendal-service-*` modules with type-erased Configurator dispatch). `Storage::build` used `Scheme::S3`/`Scheme::Azdls`/... to dispatch construction; replace with a normalized scheme-string match on `scheme_str`. Aliases (`s3`/`s3a`, `abfs[s]`/`wasb[s]`, `gs`/`gcs`) are folded into the normalizer. * Delete `CustomAwsCredentialLoader` (the iceberg-rust wrapper that re-exposed `reqsign::AwsCredentialLoad`), the `Storage::S3 { customized_credential_load }` field, the `s3_config_build` parameter, and the `with_file_io_extension`-fed `extensions.get` lookup. Downstream consumers (executor) must drop their `FileIOBuilder::with_file_io_extension(CustomAwsCredentialLoader)` call and rely on opendal-native auth. * Drop the optional `reqsign` direct dep on the `iceberg` crate (`storage-s3 = ["opendal/services-s3", "reqsign"]` → just `["opendal/services-s3"]`). reqsign was only here to materialize `AwsCredential`/`AwsCredentialLoad` for the custom loader; with that loader gone, opendal carries reqsign internally and we don't surface it. Side-effects: * Storage-azdls (used by laminar/executor/tessellate on AKS) now compiles cleanly. The reqsign 0.16.5 federated-token `expires_on` parse panic (`parse 1782137016 into rfc3339 failed`) seen on the first ADLS write was an opendal-0.55-internal bug; opendal 0.57's azdls signer doesn't go through that path. * The `opendal::services::S3Config::allow_anonymous` field is deprecated in 0.57 (`skip_signature` is the replacement). Left in place for now — there's a separate cleanup pass for the `is_truthy` props bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

opendal-service-azdls 0.57 builds a credential chain whose StaticEnv only carries the explicitly-set adls.* config keys (client_id, tenant_id, authority_host) — AZURE_FEDERATED_TOKEN_FILE is dropped on the floor, so reqsign's WorkloadIdentityCredentialProvider returns None on AKS and ADLS writes 403 AuthorizationPermissionMismatch (the IMDS fallback picks up the node-VM identity, wrong principal). Rather than fork opendal, run the federated → AAD exchange ourselves and wrap the operator's HttpClient via the public update_http_client hook on AccessorInfo. The wrapper caches the token until expiry minus 120s and only injects Authorization when the request doesn't already carry one, so static SAS / shared-key paths are untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

reqsign's DefaultCredentialProvider chain runs IMDS before our wrap sees the request — on AKS that mints a token for the *node-VM* identity (wrong principal) and sets Authorization. The previous "if absent" check left that token in place, so writes kept failing 403 AuthorizationPermissionMismatch. Replace unconditionally instead; within the WI gate it's always the right thing to do. Also log at storage construction so we can see the WI path activate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ReplaceDataFiles carries a fixed delete-file list built at action creation time. When Transaction::commit retries on OCC rejection, it reloads the table but replays the original delete list against the new base. If another writer already replaced those files, the retry succeeds with stale references, causing both merged files to coexist in the live manifest (duplicate data). Add a disable_retry flag to Transaction, set automatically when ReplaceDataFilesAction is applied. The retry predicate checks this flag and skips retry, returning the OCC error to the caller. The caller (e.g. Laminar's merge-on-write) handles the failure by dropping the merge — the original small files remain in the table (committed by FastAppend), so no data is lost. Tested: 4 concurrent replace_data_files against MemoryCatalog — 1 succeeds, 3 correctly rejected.

update_snapshot_summaries was looking up self.snapshot_id (the new snapshot being created) in the metadata to find the parent summary. Since the new snapshot doesn't exist yet, the lookup returned None, and cumulative fields (total-records, total-data-files, etc.) were computed without a previous base — always showing only the current commit's counts instead of accumulating. Fix: use table_metadata.current_snapshot() directly as the previous snapshot, since that's the latest committed state.

Tests FastAppend contention, ReplaceDataFiles race, retry behavior, and two-replica ingest simulation against MemoryCatalog. Verifies no data loss and no duplicates under concurrent commits.

github-actions

license-eye has checked 390 files.

Valid	Invalid	Ignored	Fixed
319	2	69	0

Click to see the invalid file list

PARQUET_MANIFESTS.md
crates/iceberg/tests/occ_concurrency_test.rs

Use this command to fix any missing license headers

```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

github-actions · 2026-06-24T10:59:00Z

@@ -0,0 +1,665 @@
+//! Iceberg OCC Concurrency Test


Suggested change

//! Iceberg OCC Concurrency Test

// Licensed to the Apache Software Foundation (ASF) under one

// or more contributor license agreements. See the NOTICE file

// distributed with this work for additional information

// regarding copyright ownership. The ASF licenses this file

// to you under the Apache License, Version 2.0 (the

// "License"); you may not use this file except in compliance

// with the License. You may obtain a copy of the License at

//

// http://www.apache.org/licenses/LICENSE-2.0

//

// Unless required by applicable law or agreed to in writing,

// software distributed under the License is distributed on an

// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

// KIND, either express or implied. See the License for the

// specific language governing permissions and limitations

// under the License.

//! Iceberg OCC Concurrency Test

sriv-e6x and others added 29 commits June 4, 2026 14:30

fix: CAS cache validation, content type separation, partition bounds …

5906380

…fallback

test: add OCC concurrency test for concurrent commit behavior

c59f15e

Tests FastAppend contention, ReplaceDataFiles race, retry behavior, and two-replica ingest simulation against MemoryCatalog. Verifies no data loss and no duplicates under concurrent commits.

github-actions Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(transaction): with_snapshot_id() across actions, MergeAppend auto-compaction#8

feat(transaction): with_snapshot_id() across actions, MergeAppend auto-compaction#8
srinath-prabhu wants to merge 29 commits into
mainfrom
feat/replace-data-files-v3

srinath-prabhu commented Jun 4, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-//! Iceberg OCC Concurrency Test
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//! Iceberg OCC Concurrency Test

Uh oh!

Conversation

srinath-prabhu commented Jun 4, 2026

Summary

⚠️ Important: reverted commits on main must be restored at merge time

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants