Skip to content

fix: classify physical files by path context#341

Merged
JingsongLi merged 1 commit into
apache:mainfrom
liujiwen-up:phase1-referenced-files-path-aware
May 24, 2026
Merged

fix: classify physical files by path context#341
JingsongLi merged 1 commit into
apache:mainfrom
liujiwen-up:phase1-referenced-files-path-aware

Conversation

@liujiwen-up
Copy link
Copy Markdown
Contributor

Purpose
Linked issue: close #xxx

Fix $physical_files_size over-counting non-data files as data files by replacing basename-only fallback classification with path-aware physical file classification.

Brief change log
Stop treating unknown physical files as data files.
Classify physical files by table-relative path:
manifest/manifest-, manifest/manifest-list-, manifest/index-manifest-* as manifest files.
statistics/* into manifest counters for current output schema compatibility.
index/* as index files.
partition-aware bucket-/ and bucket-postpone/* as data files.
Pass table partition depth from the DataFusion $physical_files_size system table.
Update $physical_files_size documentation to clarify that it is a diagnostic summary, not an orphan cleanup plan.
Tests
cargo fmt --check
cargo test -p paimon table::referenced_files::tests
cargo check -p paimon-datafusion
Added unit coverage for:

Root-level data-* not being counted as data.
Unknown files such as _SUCCESS, schema, snapshot, tag, and random files being ignored.
manifest-list-* counted as manifest.
statistics/* counted through compatible manifest counters.
Bucket data files counted even without a data-* prefix.
Partition-depth-aware bucket classification.
API and Format
This changes the Rust API signature of collect_physical_files_summary by adding partition_depth.

No storage format changes.

Documentation
Updated SQL system table docs for $physical_files_size to describe path-aware classification and clarify the diagnostic nature of the summary.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — solid improvement over the previous basename-only heuristic.

Highlights:

  1. The path-aware classification logic (classify_physical_path) properly uses table-relative paths and partition depth to avoid false positives. Unknown files are now correctly ignored rather than being counted as data files.

  2. Good separation of concerns: is_bucket_dir_name, is_partition_segment, is_data_file_in_bucket are well-named and independently testable.

  3. The test coverage is comprehensive — unpartitioned tables, partitioned tables, edge cases like bucket-postpone, root-level files, and nested directories that shouldn't match.

  4. Documentation update accurately reflects the new behavior and clarifies the diagnostic nature of the summary.

One question: is_partition_segment checks for key=value format but doesn't validate against actual table partition keys. Is that intentional? For tables with data-file.path-directory set to a custom directory that happens to contain = in its name, this could theoretically misclassify. Probably an extreme edge case not worth the complexity though.

@JingsongLi JingsongLi merged commit c954893 into apache:main May 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants