feat: add merged output format and parquet directory support#180
feat: add merged output format and parquet directory support#180
Conversation
Add two features needed for INFlow pipeline integration:
1. Merged output format for `winnow predict`:
- New `output_format` config option: "split" (default, legacy) or "merged"
- When "merged", writes a single calibrated_psms.tsv with all columns
(spectrum metadata + predictions + FDR metrics) instead of two CSVs
- Legacy split output (metadata.csv + preds_and_fdr_metrics.csv) remains default
2. Parquet directory support in InstaNovoDatasetLoader:
- _load_spectrum_data() now accepts a directory path
- Reads and concatenates all .parquet files in the directory
- Supports InstaNovo's sharded parquet output (dataset-*-test-*.parquet)
- Raises ValueError for empty directories
Tests: 5 new tests covering split output, merged output, directory loading,
and empty directory error handling. All 299 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
winnow.utils was added in 7330c0c but not registered in the packages list in pyproject.toml, causing ModuleNotFoundError when installed via pip from the git repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When spectrum data contains a ground-truth 'sequence' column with modification notations not in residue_remapping (e.g. C+57.021), _process_spectrum_data would crash with ConfigKeyError. Now catches the remapping error, logs a warning, and drops the 'sequence' column so the pipeline can continue without evaluation metrics. This is the correct behavior for de novo mode where ground truth sequences are optional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Additional fixes added to this PR3. Include
|
Adds winnow.__version__ by reading the installed version from importlib.metadata, keeping it in sync with pyproject.toml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Accept both .csv and .parquet files in _load_predictions_without_beams and _load_beam_preds, dispatching on file extension. This enables faster I/O when InstaNovo outputs predictions in Parquet format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Added Parquet input support for predictions ( Both Motivation: The INFlow pipeline passes predictions from InstaNovo to Winnow. With the corresponding InstaNovo change (instadeepai/InstaNovo-internal#574), predictions can be saved as Parquet for faster I/O. This change lets Winnow read them without conversion. Also previously pushed: No breaking changes — existing CSV workflows are unaffected. |
- Update class, load(), and _load_beam_preds docstrings for CSV/Parquet - Fix test_load_beam_preds_raises_for_non_csv: use .txt instead of .parquet (which is now a supported format) - Add test_beam_columns_none_load_predictions_without_beams_parquet - Update API docs and CLI docs to mention Parquet predictions - Update instanovo.yaml config comments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Updated docs, tests, and config for Parquet predictions input support (
|
Summary
Add two features needed for INFlow pipeline integration:
1. Merged output format for
winnow predictoutput_formatconfig option inpredict.yaml:"split"(default, legacy) or"merged""merged", writes a singlecalibrated_psms.tsvwith all columns (spectrum metadata + predictions + FDR metrics) instead of splitting into two CSVsmetadata.csv+preds_and_fdr_metrics.csv) remains the default — no breaking changewinnow predict output_format=merged2. Parquet directory support in
InstaNovoDatasetLoader_load_spectrum_data()now accepts a directory path containing.parquetfiles.parquetfiles in the directorydataset-*-test-*.parquet)ValueErrorfor empty directoriesMotivation
The INFlow Nextflow pipeline needs:
Test plan
tests/scripts/test_predict_output.py🤖 Generated with Claude Code