Skip to content

Add msproteomics test datasets#1952

Closed
an-altosian wants to merge 1 commit intonf-core:masterfrom
an-altosian:msproteomics
Closed

Add msproteomics test datasets#1952
an-altosian wants to merge 1 commit intonf-core:masterfrom
an-altosian:msproteomics

Conversation

@an-altosian
Copy link
Copy Markdown

Summary

Public test datasets for the nf-core/msproteomics pipeline.

  • TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) — 2 mzML subsets
  • DDA LFQ: Zenodo 1051552 (Human SILAC) — 2 mzML subsets
  • DIA: CPTAC CCRCC (Human DIA) — 1 mzML subset
  • FASTA: UniProt reference databases (Erwinia, Human SwissProt 2000-protein subset, E.coli+UPS1)
  • Module inputs: Pre-computed intermediate files for unit testing individual modules
  • Samplesheets: CSV inputs for all workflow test profiles

FASTA file sizes

File Size Proteins
ecoli_ups1_test.fasta 1.8 MB
erwinia_carotovora.fasta 1.6 MB
erwinia_uniprot.fasta 1.9 MB
human_sp_subset.fasta 1.4 MB 2,000

human_sp_subset.fasta is a smart subset: 169 proteins identified from the DDA LFQ test spectra (HEK SILAC) + 1,831 evenly-spaced SwissProt entries for search space diversity. Validated by running the full FragPipe DDA LFQ pipeline end-to-end (174 proteins identified at 1% FDR).

Supersedes #1946 (closed due to force-push history issue).

🤖 Generated with Claude Code

Public datasets for nf-core/msproteomics pipeline stub and integration testing:
- TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) - 2 mzML subsets
- DDA LFQ: Zenodo 1051552 (Human SILAC) - 2 mzML subsets
- DIA: CPTAC CCRCC (Human DIA) - 1 mzML subset
- FASTA: UniProt reference databases (Erwinia, Human SwissProt subset, E.coli+UPS1)
- Module inputs: pre-computed intermediate files for unit testing individual modules
- Samplesheets: CSV inputs for all workflow test profiles
- Script: generate_test_subsets.sh for reproducible subset generation

human_sp_subset.fasta contains 2000 proteins: 169 identified from DDA LFQ
test spectra (HEK SILAC) + 1831 evenly-spaced entries for search space diversity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@an-altosian
Copy link
Copy Markdown
Author

Closing - wrong base branch. Will recreate targeting msproteomics branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant