feat: add merged output format and parquet directory support by BioGeek · Pull Request #180 · instadeepai/winnow

BioGeek · 2026-04-10T13:43:56Z

Summary

Add two features needed for INFlow pipeline integration:

1. Merged output format for `winnow predict`

New output_format config option in predict.yaml: "split" (default, legacy) or "merged"
When "merged", writes a single calibrated_psms.tsv with all columns (spectrum metadata + predictions + FDR metrics) instead of splitting into two CSVs
Legacy split output (metadata.csv + preds_and_fdr_metrics.csv) remains the default — no breaking change
Usage: winnow predict output_format=merged

2. Parquet directory support in `InstaNovoDatasetLoader`

_load_spectrum_data() now accepts a directory path containing .parquet files
Reads and concatenates all .parquet files in the directory
Supports InstaNovo's sharded parquet output (dataset-*-test-*.parquet)
Raises ValueError for empty directories

Motivation

The INFlow Nextflow pipeline needs:

A single canonical calibrated PSM table to pass between processes (merged output)
Ability to read InstaNovo's native parquet directory output (sharded parquet files)

Test plan

5 new tests in tests/scripts/test_predict_output.py
All 299 tests pass (294 existing + 5 new)
Pre-commit hooks pass (flake8, ruff, mypy, etc.)

🤖 Generated with Claude Code

Add two features needed for INFlow pipeline integration: 1. Merged output format for `winnow predict`: - New `output_format` config option: "split" (default, legacy) or "merged" - When "merged", writes a single calibrated_psms.tsv with all columns (spectrum metadata + predictions + FDR metrics) instead of two CSVs - Legacy split output (metadata.csv + preds_and_fdr_metrics.csv) remains default 2. Parquet directory support in InstaNovoDatasetLoader: - _load_spectrum_data() now accepts a directory path - Reads and concatenates all .parquet files in the directory - Supports InstaNovo's sharded parquet output (dataset-*-test-*.parquet) - Raises ValueError for empty directories Tests: 5 new tests covering split output, merged output, directory loading, and empty directory error handling. All 299 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-10T13:45:32Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	316	7	97%	247–248, 445, 734, 922, 926, 1224
calibrator.py	91	15	83%	69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
data_loaders.py	275	14	94%	23, 189, 220–221, 423, 466, 858, 862, 911, 922, 1036–1037, 1073–1074
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	192	150	21%	17–20, 53, 55–56, 68, 76, 86, 88–90, 92, 94–99, 104–105, 138–140, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–387, 389, 391, 393, 395, 397–399, 402, 408, 411, 415, 454–455, 492–493, 534–535, 565–566, 593–594, 624–625
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	1285	274	78%

Tests	Skipped	Failures	Errors	Time
299	0 💤	0 ❌	0 🔥	36.882s ⏱️

winnow.utils was added in 7330c0c but not registered in the packages list in pyproject.toml, causing ModuleNotFoundError when installed via pip from the git repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When spectrum data contains a ground-truth 'sequence' column with modification notations not in residue_remapping (e.g. C+57.021), _process_spectrum_data would crash with ConfigKeyError. Now catches the remapping error, logs a warning, and drops the 'sequence' column so the pipeline can continue without evaluation metrics. This is the correct behavior for de novo mode where ground truth sequences are optional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BioGeek · 2026-04-10T22:46:21Z

Additional fixes added to this PR

3. Include `winnow.utils` in package distribution (`3894cc6`)

winnow.utils was added to main but not registered in pyproject.toml packages list, causing ModuleNotFoundError when installed via pip. Also affects main branch.

4. Gracefully handle unmappable residues in ground-truth sequences (`c0c9919`)

When spectrum data contains a sequence column with modification notations not in residue_remapping (e.g. C+57.021), _process_spectrum_data crashed with ConfigKeyError. Now catches the error, logs a warning, and drops the column so the pipeline continues. Correct for de novo mode.

All 299 tests pass.

Adds winnow.__version__ by reading the installed version from importlib.metadata, keeping it in sync with pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Accept both .csv and .parquet files in _load_predictions_without_beams and _load_beam_preds, dispatching on file extension. This enables faster I/O when InstaNovo outputs predictions in Parquet format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BioGeek · 2026-04-14T13:16:34Z

Added Parquet input support for predictions (1174cd9).

Both _load_predictions_without_beams and _load_beam_preds now accept .parquet files in addition to .csv, dispatching on file extension.

Motivation: The INFlow pipeline passes predictions from InstaNovo to Winnow. With the corresponding InstaNovo change (instadeepai/InstaNovo-internal#574), predictions can be saved as Parquet for faster I/O. This change lets Winnow read them without conversion.

Also previously pushed: winnow.__version__ support via importlib.metadata (9d898b2).

No breaking changes — existing CSV workflows are unaffected.

- Update class, load(), and _load_beam_preds docstrings for CSV/Parquet - Fix test_load_beam_preds_raises_for_non_csv: use .txt instead of .parquet (which is now a supported format) - Add test_beam_columns_none_load_predictions_without_beams_parquet - Update API docs and CLI docs to mention Parquet predictions - Update instanovo.yaml config comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BioGeek · 2026-04-14T13:22:57Z

Updated docs, tests, and config for Parquet predictions input support (3b60d1f):

Docstrings: Updated class docstring, load(), and _load_beam_preds to mention CSV/Parquet
Tests:
- Fixed test_load_beam_preds_raises_for_non_csv → renamed to test_load_beam_preds_raises_for_unsupported_format, now uses .txt (since .parquet is now supported)
- Added test_beam_columns_none_load_predictions_without_beams_parquet
Docs: Updated datasets.md, cli.md, and instanovo.yaml config comments

BioGeek requested a review from JemmaLDaniel April 10, 2026 13:59

BioGeek and others added 2 commits April 10, 2026 23:33

fix: include winnow.utils in package distribution

3894cc6

winnow.utils was added in 7330c0c but not registered in the packages list in pyproject.toml, causing ModuleNotFoundError when installed via pip from the git repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BioGeek and others added 2 commits April 14, 2026 12:08

feat: expose __version__ from package metadata

9d898b2

Adds winnow.__version__ by reading the installed version from importlib.metadata, keeping it in sync with pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add merged output format and parquet directory support#180

feat: add merged output format and parquet directory support#180
BioGeek wants to merge 6 commits intomainfrom
feat-merged-predict-output

BioGeek commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

BioGeek commented Apr 10, 2026

Uh oh!

BioGeek commented Apr 14, 2026

Uh oh!

BioGeek commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BioGeek commented Apr 10, 2026

Summary

1. Merged output format for winnow predict

2. Parquet directory support in InstaNovoDatasetLoader

Motivation

Test plan

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioGeek commented Apr 10, 2026

Additional fixes added to this PR

3. Include winnow.utils in package distribution (3894cc6)

4. Gracefully handle unmappable residues in ground-truth sequences (c0c9919)

Uh oh!

BioGeek commented Apr 14, 2026

Uh oh!

BioGeek commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Merged output format for `winnow predict`

2. Parquet directory support in `InstaNovoDatasetLoader`

github-actions Bot commented Apr 10, 2026 •

edited

Loading

3. Include `winnow.utils` in package distribution (`3894cc6`)

4. Gracefully handle unmappable residues in ground-truth sequences (`c0c9919`)