Skip to content

feat: add merged output format and parquet directory support#180

Open
BioGeek wants to merge 6 commits intomainfrom
feat-merged-predict-output
Open

feat: add merged output format and parquet directory support#180
BioGeek wants to merge 6 commits intomainfrom
feat-merged-predict-output

Conversation

@BioGeek
Copy link
Copy Markdown
Contributor

@BioGeek BioGeek commented Apr 10, 2026

Summary

Add two features needed for INFlow pipeline integration:

1. Merged output format for winnow predict

  • New output_format config option in predict.yaml: "split" (default, legacy) or "merged"
  • When "merged", writes a single calibrated_psms.tsv with all columns (spectrum metadata + predictions + FDR metrics) instead of splitting into two CSVs
  • Legacy split output (metadata.csv + preds_and_fdr_metrics.csv) remains the default — no breaking change
  • Usage: winnow predict output_format=merged

2. Parquet directory support in InstaNovoDatasetLoader

  • _load_spectrum_data() now accepts a directory path containing .parquet files
  • Reads and concatenates all .parquet files in the directory
  • Supports InstaNovo's sharded parquet output (dataset-*-test-*.parquet)
  • Raises ValueError for empty directories

Motivation

The INFlow Nextflow pipeline needs:

  • A single canonical calibrated PSM table to pass between processes (merged output)
  • Ability to read InstaNovo's native parquet directory output (sharded parquet files)

Test plan

  • 5 new tests in tests/scripts/test_predict_output.py
  • All 299 tests pass (294 existing + 5 new)
  • Pre-commit hooks pass (flake8, ruff, mypy, etc.)

🤖 Generated with Claude Code

Add two features needed for INFlow pipeline integration:

1. Merged output format for `winnow predict`:
   - New `output_format` config option: "split" (default, legacy) or "merged"
   - When "merged", writes a single calibrated_psms.tsv with all columns
     (spectrum metadata + predictions + FDR metrics) instead of two CSVs
   - Legacy split output (metadata.csv + preds_and_fdr_metrics.csv) remains default

2. Parquet directory support in InstaNovoDatasetLoader:
   - _load_spectrum_data() now accepts a directory path
   - Reads and concatenates all .parquet files in the directory
   - Supports InstaNovo's sharded parquet output (dataset-*-test-*.parquet)
   - Raises ValueError for empty directories

Tests: 5 new tests covering split output, merged output, directory loading,
and empty directory error handling. All 299 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py316797%247–248, 445, 734, 922, 926, 1224
   calibrator.py911583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   data_loaders.py2751494%23, 189, 220–221, 423, 466, 858, 862, 911, 922, 1036–1037, 1073–1074
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py19215021%17–20, 53, 55–56, 68, 76, 86, 88–90, 92, 94–99, 104–105, 138–140, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–387, 389, 391, 393, 395, 397–399, 402, 408, 411, 415, 454–455, 492–493, 534–535, 565–566, 593–594, 624–625
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL128527478% 

Tests Skipped Failures Errors Time
299 0 💤 0 ❌ 0 🔥 36.882s ⏱️

@BioGeek BioGeek requested a review from JemmaLDaniel April 10, 2026 13:59
BioGeek and others added 2 commits April 10, 2026 23:33
winnow.utils was added in 7330c0c but not registered in the packages
list in pyproject.toml, causing ModuleNotFoundError when installed
via pip from the git repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When spectrum data contains a ground-truth 'sequence' column with
modification notations not in residue_remapping (e.g. C+57.021),
_process_spectrum_data would crash with ConfigKeyError.

Now catches the remapping error, logs a warning, and drops the
'sequence' column so the pipeline can continue without evaluation
metrics. This is the correct behavior for de novo mode where ground
truth sequences are optional.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@BioGeek
Copy link
Copy Markdown
Contributor Author

BioGeek commented Apr 10, 2026

Additional fixes added to this PR

3. Include winnow.utils in package distribution (3894cc6)

winnow.utils was added to main but not registered in pyproject.toml packages list, causing ModuleNotFoundError when installed via pip. Also affects main branch.

4. Gracefully handle unmappable residues in ground-truth sequences (c0c9919)

When spectrum data contains a sequence column with modification notations not in residue_remapping (e.g. C+57.021), _process_spectrum_data crashed with ConfigKeyError. Now catches the error, logs a warning, and drops the column so the pipeline continues. Correct for de novo mode.

All 299 tests pass.

BioGeek and others added 2 commits April 14, 2026 12:08
Adds winnow.__version__ by reading the installed version from
importlib.metadata, keeping it in sync with pyproject.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accept both .csv and .parquet files in _load_predictions_without_beams
and _load_beam_preds, dispatching on file extension. This enables
faster I/O when InstaNovo outputs predictions in Parquet format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@BioGeek
Copy link
Copy Markdown
Contributor Author

BioGeek commented Apr 14, 2026

Added Parquet input support for predictions (1174cd9).

Both _load_predictions_without_beams and _load_beam_preds now accept .parquet files in addition to .csv, dispatching on file extension.

Motivation: The INFlow pipeline passes predictions from InstaNovo to Winnow. With the corresponding InstaNovo change (instadeepai/InstaNovo-internal#574), predictions can be saved as Parquet for faster I/O. This change lets Winnow read them without conversion.

Also previously pushed: winnow.__version__ support via importlib.metadata (9d898b2).

No breaking changes — existing CSV workflows are unaffected.

- Update class, load(), and _load_beam_preds docstrings for CSV/Parquet
- Fix test_load_beam_preds_raises_for_non_csv: use .txt instead of
  .parquet (which is now a supported format)
- Add test_beam_columns_none_load_predictions_without_beams_parquet
- Update API docs and CLI docs to mention Parquet predictions
- Update instanovo.yaml config comments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@BioGeek
Copy link
Copy Markdown
Contributor Author

BioGeek commented Apr 14, 2026

Updated docs, tests, and config for Parquet predictions input support (3b60d1f):

  • Docstrings: Updated class docstring, load(), and _load_beam_preds to mention CSV/Parquet
  • Tests:
    • Fixed test_load_beam_preds_raises_for_non_csv → renamed to test_load_beam_preds_raises_for_unsupported_format, now uses .txt (since .parquet is now supported)
    • Added test_beam_columns_none_load_predictions_without_beams_parquet
  • Docs: Updated datasets.md, cli.md, and instanovo.yaml config comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant