fix: classify physical files by path context#341
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
LGTM — solid improvement over the previous basename-only heuristic.
Highlights:
-
The path-aware classification logic (
classify_physical_path) properly uses table-relative paths and partition depth to avoid false positives. Unknown files are now correctly ignored rather than being counted as data files. -
Good separation of concerns:
is_bucket_dir_name,is_partition_segment,is_data_file_in_bucketare well-named and independently testable. -
The test coverage is comprehensive — unpartitioned tables, partitioned tables, edge cases like
bucket-postpone, root-level files, and nested directories that shouldn't match. -
Documentation update accurately reflects the new behavior and clarifies the diagnostic nature of the summary.
One question: is_partition_segment checks for key=value format but doesn't validate against actual table partition keys. Is that intentional? For tables with data-file.path-directory set to a custom directory that happens to contain = in its name, this could theoretically misclassify. Probably an extreme edge case not worth the complexity though.
Purpose
Linked issue: close #xxx
Fix $physical_files_size over-counting non-data files as data files by replacing basename-only fallback classification with path-aware physical file classification.
Brief change log
Stop treating unknown physical files as data files.
Classify physical files by table-relative path:
manifest/manifest-, manifest/manifest-list-, manifest/index-manifest-* as manifest files.
statistics/* into manifest counters for current output schema compatibility.
index/* as index files.
partition-aware bucket-/ and bucket-postpone/* as data files.
Pass table partition depth from the DataFusion $physical_files_size system table.
Update $physical_files_size documentation to clarify that it is a diagnostic summary, not an orphan cleanup plan.
Tests
cargo fmt --check
cargo test -p paimon table::referenced_files::tests
cargo check -p paimon-datafusion
Added unit coverage for:
Root-level data-* not being counted as data.
Unknown files such as _SUCCESS, schema, snapshot, tag, and random files being ignored.
manifest-list-* counted as manifest.
statistics/* counted through compatible manifest counters.
Bucket data files counted even without a data-* prefix.
Partition-depth-aware bucket classification.
API and Format
This changes the Rust API signature of collect_physical_files_summary by adding partition_depth.
No storage format changes.
Documentation
Updated SQL system table docs for $physical_files_size to describe path-aware classification and clarify the diagnostic nature of the summary.