Fix parquet loading crash from datasets version mismatch#1140
Fix parquet loading crash from datasets version mismatch#1140yeyu-nvidia wants to merge 3 commits intomainfrom
Conversation
When local parquet files contain HF datasets metadata written by a
different version of the `datasets` library, `load_dataset("parquet")`
can raise a TypeError during feature deserialization. Fall back to
reading via PyArrow directly in that case.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdded a try/except around parquet dataset loading in SPEEDBench._load_dataset; on TypeError the code falls back to reading parquet files with pyarrow, optionally strips Arrow schema metadata, concatenates tables, and constructs a HuggingFace Dataset from the resulting Arrow table. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant SPEEDBench
participant DatasetsLib as "datasets.load_dataset"
participant PyArrow as "pyarrow.parquet"
participant HF_Dataset as "datasets.Dataset"
Caller->>SPEEDBench: request dataset load
SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
alt success
DatasetsLib-->>SPEEDBench: Dataset
SPEEDBench-->>Caller: return Dataset (possibly truncated)
else TypeError
DatasetsLib--xSPEEDBench: raises TypeError
SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
PyArrow-->>SPEEDBench: Table(s)
SPEEDBench->>PyArrow: concat_tables(tables)
PyArrow-->>SPEEDBench: concatenated Table
SPEEDBench->>SPEEDBench: strip b"huggingface" metadata if present
SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
HF_Dataset-->>SPEEDBench: Dataset
SPEEDBench-->>Caller: return Dataset (possibly truncated)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3
📒 Files selected for processing (1)
examples/specdec_bench/specdec_bench/datasets/speed.py
| except TypeError: | ||
| # Fallback: parquet metadata may be incompatible with the installed | ||
| # ``datasets`` version. Read via PyArrow and convert directly. | ||
| import pyarrow | ||
| import pyarrow.parquet as pq | ||
| from datasets import Dataset as HFDataset | ||
|
|
||
| tables = [pq.read_table(f) for f in data_files["test"]] | ||
| table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0] | ||
| dataset = HFDataset(table) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What exact exception types/messages does Hugging Face datasetsraise when loading Parquet files whose embeddedfeaturesmetadata is incompatible acrossdatasets versions?
💡 Result:
When datasets reads a Parquet file that contains embedded 🤗 Datasets schema metadata (stored under the Parquet/Arrow schema metadata key b"huggingface"), it reconstructs Features from the serialized dict. If that dict contains a feature _type that doesn’t exist in the installed datasets version, the load fails with:
- Exception type:
ValueError - Exception message (verbatim example seen when loading Parquet written by
datasets>=4.0.0—which embeds_type: "List"—usingdatasets==3.6.0):
ValueError: Feature type 'List' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'LargeList', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image', 'Video', 'Pdf'][1][2]
This is raised from datasets.features.features.generate_from_dict(...) when _FEATURE_TYPES (in that installed version) has no entry for the serialized _type string. [1]
Sources: [1] huggingface/datasets#7676 [2] verl-project/verl#2723
🏁 Script executed:
# Find the full context of the affected function to understand the intent
fd -t f speed.pyRepository: NVIDIA/Model-Optimizer
Length of output: 167
🏁 Script executed:
# Read the file around lines 710-740 to understand the full context
cat -n examples/specdec_bench/specdec_bench/datasets/speed.py | sed -n '710,740p'Repository: NVIDIA/Model-Optimizer
Length of output: 1672
🏁 Script executed:
# Check if there are any comments or tests that clarify what TypeError is expected
rg -B5 -A10 'except TypeError' examples/specdec_bench/specdec_bench/datasets/speed.pyRepository: NVIDIA/Model-Optimizer
Length of output: 857
Change caught exception type from TypeError to ValueError to match the actual metadata incompatibility error from datasets.
At line 721, the code documents a fallback for parquet metadata incompatibility but catches TypeError. However, Hugging Face datasets raises ValueError when loading a Parquet file with embedded feature metadata (_type key) not available in the installed version. Example error: "Feature type 'List' not found. Available feature types: [...]" raised from datasets.features.features.generate_from_dict(...).
As written, the PyArrow fallback would never execute for the documented use case. Change except TypeError: to except ValueError: to properly trigger the fallback for metadata incompatibility, or clarify what TypeError scenario the current catch is intended to handle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 721 -
730, The except clause currently catching TypeError around the parquet fallback
(the block that imports pyarrow, pq.read_table, pyarrow.concat_tables and
constructs HFDataset(table) from data_files["test"]) should be changed so the
fallback actually runs for Hugging Face metadata incompatibility errors: replace
`except TypeError:` with `except ValueError:` (or `except (TypeError,
ValueError):` if you want to handle both) so the PyArrow-to-HFDataset fallback
triggers when datasets raises the ValueError about unknown feature types.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1140 +/- ##
==========================================
+ Coverage 70.14% 70.19% +0.04%
==========================================
Files 230 230
Lines 26053 26073 +20
==========================================
+ Hits 18276 18302 +26
+ Misses 7777 7771 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The PyArrow fallback still failed because HFDataset(table) parses the huggingface metadata embedded in the arrow schema, hitting the same TypeError. Strip that metadata before constructing the Dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous pin (>=4.4.0) caused concurrent pip installs across ranks to race and corrupt the datasets package, breaking tensorrt_llm imports entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
Summary
datasetsmetadata written by a different library version,load_dataset("parquet")raises aTypeErrorduring feature deserializationTypeErrorand reads parquet files directly via PyArrow, bypassing the incompatible metadataTest plan
specdec_benchwith EAGLE config against local parquet dataset filesload_datasetpath🤖 Generated with Claude Code
Summary by CodeRabbit