Fix parquet loading crash from datasets version mismatch by yeyu-nvidia · Pull Request #1140 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-03-30T16:24:55Z

Summary

When local parquet files contain HF datasets metadata written by a different library version, load_dataset("parquet") raises a TypeError during feature deserialization
Added a fallback that catches the TypeError and reads parquet files directly via PyArrow, bypassing the incompatible metadata

Test plan

Run specdec_bench with EAGLE config against local parquet dataset files
Verify normal (compatible) parquet loading still works via the primary load_dataset path

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved robustness of the parquet dataset loader with a safer fallback path to ensure reliable loading across environments.
Chores
- Broadened the supported version range for the datasets dependency to increase compatibility.

When local parquet files contain HF datasets metadata written by a different version of the `datasets` library, `load_dataset("parquet")` can raise a TypeError during feature deserialization. Fall back to reading via PyArrow directly in that case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

coderabbitai · 2026-03-30T16:25:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b43c5829-7e39-4b28-afb5-014daf5152f8

📥 Commits

Reviewing files that changed from the base of the PR and between 45a33f8 and 94fadbd.

📒 Files selected for processing (2)

examples/specdec_bench/requirements_speed.txt
examples/specdec_bench/specdec_bench/datasets/speed.py

✅ Files skipped from review due to trivial changes (1)

examples/specdec_bench/specdec_bench/datasets/speed.py

📝 Walkthrough

Walkthrough

Added a try/except around parquet dataset loading in SPEEDBench._load_dataset; on TypeError the code falls back to reading parquet files with pyarrow, optionally strips Arrow schema metadata, concatenates tables, and constructs a HuggingFace Dataset from the resulting Arrow table.

Changes

Cohort / File(s)	Summary
Parquet load fallback `examples/specdec_bench/specdec_bench/datasets/speed.py`	Wrap `datasets.load_dataset("parquet", ...)` in `try/except TypeError`. On error, read files with `pyarrow.parquet.read_table`, `pyarrow.concat_tables`, strip `b"huggingface"` schema metadata if present, and build a `datasets.Dataset` from the Arrow table as fallback.
Dependency constraint update `examples/specdec_bench/requirements_speed.txt`	Relaxed `datasets` version constraint from `>=4.4.0,<5.0.0` to `>=3.1.0` (upper bound removed).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant SPEEDBench
    participant DatasetsLib as "datasets.load_dataset"
    participant PyArrow as "pyarrow.parquet"
    participant HF_Dataset as "datasets.Dataset"

    Caller->>SPEEDBench: request dataset load
    SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
    alt success
        DatasetsLib-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    else TypeError
        DatasetsLib--xSPEEDBench: raises TypeError
        SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
        PyArrow-->>SPEEDBench: Table(s)
        SPEEDBench->>PyArrow: concat_tables(tables)
        PyArrow-->>SPEEDBench: concatenated Table
        SPEEDBench->>SPEEDBench: strip b"huggingface" metadata if present
        SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
        HF_Dataset-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Fix parquet loading crash from datasets version mismatch' directly addresses the main issue and solution: it identifies a specific problem (parquet loading crash) caused by a version mismatch, which aligns perfectly with both the code changes and PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	Pull request introduces parquet loading fallback using PyArrow without security anti-patterns from SECURITY.md.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yeyu/fix-parquet-datasets-compat

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-30T16:28:55Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1140/
Built to branch `gh-pages` at 2026-03-31 17:00 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3

📥 Commits

Reviewing files that changed from the base of the PR and between a3f5c46 and 45a33f8.

📒 Files selected for processing (1)

examples/specdec_bench/specdec_bench/datasets/speed.py

coderabbitai · 2026-03-30T16:32:44Z

examples/specdec_bench/specdec_bench/datasets/speed.py

+            except TypeError:
+                # Fallback: parquet metadata may be incompatible with the installed
+                # ``datasets`` version.  Read via PyArrow and convert directly.
+                import pyarrow
+                import pyarrow.parquet as pq
+                from datasets import Dataset as HFDataset
+
+                tables = [pq.read_table(f) for f in data_files["test"]]
+                table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0]
+                dataset = HFDataset(table)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What exact exception types/messages does Hugging Face datasetsraise when loading Parquet files whose embeddedfeaturesmetadata is incompatible acrossdatasets versions?

💡 Result:

When datasets reads a Parquet file that contains embedded 🤗 Datasets schema metadata (stored under the Parquet/Arrow schema metadata key b"huggingface"), it reconstructs Features from the serialized dict. If that dict contains a feature _type that doesn’t exist in the installed datasets version, the load fails with:

Exception type: ValueError

Exception message (verbatim example seen when loading Parquet written by datasets>=4.0.0—which embeds _type: "List"—using datasets==3.6.0):
ValueError: Feature type 'List' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'LargeList', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image', 'Video', 'Pdf'] [1][2]

This is raised from datasets.features.features.generate_from_dict(...) when _FEATURE_TYPES (in that installed version) has no entry for the serialized _type string. [1]

Sources: [1] huggingface/datasets#7676 [2] verl-project/verl#2723

🏁 Script executed:

# Find the full context of the affected function to understand the intent fd -t f speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 167

🏁 Script executed:

# Read the file around lines 710-740 to understand the full context cat -n examples/specdec_bench/specdec_bench/datasets/speed.py | sed -n '710,740p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1672

🏁 Script executed:

# Check if there are any comments or tests that clarify what TypeError is expected rg -B5 -A10 'except TypeError' examples/specdec_bench/specdec_bench/datasets/speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 857

Change caught exception type from TypeError to ValueError to match the actual metadata incompatibility error from datasets.

At line 721, the code documents a fallback for parquet metadata incompatibility but catches TypeError. However, Hugging Face datasets raises ValueError when loading a Parquet file with embedded feature metadata (_type key) not available in the installed version. Example error: "Feature type 'List' not found. Available feature types: [...]" raised from datasets.features.features.generate_from_dict(...).

As written, the PyArrow fallback would never execute for the documented use case. Change except TypeError: to except ValueError: to properly trigger the fallback for metadata incompatibility, or clarify what TypeError scenario the current catch is intended to handle.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 721 - 730, The except clause currently catching TypeError around the parquet fallback (the block that imports pyarrow, pq.read_table, pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"]) should be changed so the fallback actually runs for Hugging Face metadata incompatibility errors: replace `except TypeError:` with `except ValueError:` (or `except (TypeError, ValueError):` if you want to handle both) so the PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about unknown feature types.

codecov · 2026-03-30T16:36:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.19%. Comparing base (a3f5c46) to head (94fadbd).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   70.14%   70.19%   +0.04%     
==========================================
  Files         230      230              
  Lines       26053    26073      +20     
==========================================
+ Hits        18276    18302      +26     
+ Misses       7777     7771       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The PyArrow fallback still failed because HFDataset(table) parses the huggingface metadata embedded in the arrow schema, hitting the same TypeError. Strip that metadata before constructing the Dataset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous pin (>=4.4.0) caused concurrent pip installs across ranks to race and corrupt the datasets package, breaking tensorrt_llm imports entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

yeyu-nvidia requested a review from a team as a code owner March 30, 2026 16:24

yeyu-nvidia requested a review from h-guo18 March 30, 2026 16:24

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

yeyu-nvidia and others added 2 commits March 31, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet loading crash from datasets version mismatch#1140

Fix parquet loading crash from datasets version mismatch#1140
yeyu-nvidia wants to merge 3 commits intomainfrom
yeyu/fix-parquet-datasets-compat

yeyu-nvidia commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 17:00 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yeyu-nvidia commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-31 17:00 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yeyu-nvidia commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 17:00 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Mar 30, 2026 •

edited

Loading