fix: classify physical files by path context by liujiwen-up · Pull Request #341 · apache/paimon-rust

liujiwen-up · 2026-05-21T12:14:29Z

Purpose
Linked issue: close #xxx

Fix $physical_files_size over-counting non-data files as data files by replacing basename-only fallback classification with path-aware physical file classification.

Brief change log
Stop treating unknown physical files as data files.
Classify physical files by table-relative path:
manifest/manifest-, manifest/manifest-list-, manifest/index-manifest-* as manifest files.
statistics/* into manifest counters for current output schema compatibility.
index/* as index files.
partition-aware bucket-/ and bucket-postpone/* as data files.
Pass table partition depth from the DataFusion $physical_files_size system table.
Update $physical_files_size documentation to clarify that it is a diagnostic summary, not an orphan cleanup plan.
Tests
cargo fmt --check
cargo test -p paimon table::referenced_files::tests
cargo check -p paimon-datafusion
Added unit coverage for:

Root-level data-* not being counted as data.
Unknown files such as _SUCCESS, schema, snapshot, tag, and random files being ignored.
manifest-list-* counted as manifest.
statistics/* counted through compatible manifest counters.
Bucket data files counted even without a data-* prefix.
Partition-depth-aware bucket classification.
API and Format
This changes the Rust API signature of collect_physical_files_summary by adding partition_depth.

No storage format changes.

Documentation
Updated SQL system table docs for $physical_files_size to describe path-aware classification and clarify the diagnostic nature of the summary.

JingsongLi

LGTM — solid improvement over the previous basename-only heuristic.

Highlights:

The path-aware classification logic (classify_physical_path) properly uses table-relative paths and partition depth to avoid false positives. Unknown files are now correctly ignored rather than being counted as data files.
Good separation of concerns: is_bucket_dir_name, is_partition_segment, is_data_file_in_bucket are well-named and independently testable.
The test coverage is comprehensive — unpartitioned tables, partitioned tables, edge cases like bucket-postpone, root-level files, and nested directories that shouldn't match.
Documentation update accurately reflects the new behavior and clarifies the diagnostic nature of the summary.

One question: is_partition_segment checks for key=value format but doesn't validate against actual table partition keys. Is that intentional? For tables with data-file.path-directory set to a custom directory that happens to contain = in its name, this could theoretically misclassify. Probably an extreme edge case not worth the complexity though.

fix: classify physical files by path context

4a3a005

JingsongLi approved these changes May 24, 2026

View reviewed changes

JingsongLi merged commit c954893 into apache:main May 24, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: classify physical files by path context#341

fix: classify physical files by path context#341
JingsongLi merged 1 commit into
apache:mainfrom
liujiwen-up:phase1-referenced-files-path-aware

liujiwen-up commented May 21, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liujiwen-up commented May 21, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants