Skip to content

Standardized multiple inputs#156

Merged
joverlee521 merged 8 commits intomasterfrom
standardized-multiple-inputs
Oct 28, 2025
Merged

Standardized multiple inputs#156
joverlee521 merged 8 commits intomasterfrom
standardized-multiple-inputs

Conversation

@joverlee521
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 commented Sep 26, 2025

Description of proposed changes

Use latest guidance for handling inputs based on nextstrain/pathogen-repo-guide#91.

  • use shared remote_inputs.smk, which bumps minimum Snakemake to 8.0.0
  • use augur merge for sequences
  • use augur read-file or augur merge depending on number of inputs (see f5c9949)

The config parameters remain unchanged so this should not affect outside users. Downstream rules are now expected to use "results/metadata.tsv" and "results/sequences_{segment}.fasta" as inputs instead of the input_* functions.

Related issue(s)

nextstrain/public#25

Checklist

  • Checks pass

Post-merge:

subrepo:
  subdir:   "shared/vendored"
  merged:   "43e5a6f"
upstream:
  origin:   "https://github.com/nextstrain/shared"
  branch:   "main"
  commit:   "43e5a6f"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Removes the need to custom rules to download from AWS S3 and adds support for 
https, but requires Snakemake >=8.0.0.
@joverlee521
Copy link
Copy Markdown
Contributor Author

Will merge this by EOD tomorrow if there are no comments.

Comment thread .github/workflows/ci.yaml Outdated
Comment on lines +14 to +17
build-args: |
--snakefile segment-focused/Snakefile \
-pf test_target \
--config 'inputs=[{"name": "example", "metadata": "example_data/gisaid/metadata.tsv", "sequences": "example_data/gisaid/sequences_{segment}.fasta"}]'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be updated in augur/docker-base/conda-base?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right! This change is necessary because the merged inputs have changed from data/metadata.tsv to results/metadata.tsv so the automatic copying of example data in pathogen-repo-ci.yaml@v0 no longer works.

Hm, I'll take this opportunity to make it a separate config file to make future changes easier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, moving this into a separate config file has made it apparent that the local inputs will need to be wrapped in resolve_config_path somewhere.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, moving this into a separate config file has made it apparent that the local inputs will need to be wrapped in resolve_config_path somewhere.

For running within the repo or from external analysis directories? I'm assuming the latter, as this has come up a bit both for local ingest results and in the context of storing USVI Zika data within the repo. I don't think we need to support this. (If it's not working running within the repo, can you say more?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added separate CI build in 426f8df.


For running within the repo or from external analysis directories?

From within the repo, it only works if you run the CI workflow from the top level directory, e.g. nextstrain build . -s build-configs/ci/Snakefile is fine. However, running from within the nested directory does not work:

$ cd build-configs/ci
$ nextstrain build .
Assuming unrestricted shared filesystem usage.
host: 457910275e93
Building DAG of jobs...
MissingInputException in rule merge_sequences in file "/nextstrain/build/rules/merge_inputs.smk", line 126:
Missing input files for rule merge_sequences:
    output: results/sequences_ha.fasta
    wildcards: segment=ha
    affected files:
        example_data/gisaid/sequences_ha.fasta

This definitely does not work for external analysis directories.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm in agreement here that we probably don't need to support running the CI build from external directories.

Since this is no longer needed, I've removed the additional Snakefile in the latest force-push and just left the config.yaml which can be used in the CI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also realized I never followed up on this, I'll make the corresponding changes in those 3 repos.

Downstream PRs for updating CI workflows:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does make the inputs/additional_inputs params different from other config params that expect file paths: they must always be relative to the working directory.

That's a subtlety of path_or_url, not inputs/additional_inputs per-se. We could add a path-resolver argument to the function and/or an explicit check that the requested local file path exists. (The reason we don't do this is that we can use the function to create / write files, and perhaps some other complexity that's slipping my mind currently.) Alternatively, this goal could be achieved by supplying search-paths to the consuming augur command.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this goal could be achieved by supplying search-paths to the consuming augur command.

nextstrain/augur#1897

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jameshadfield, I'm continuing the idea of adding path resolver to path_or_url in nextstrain/shared#65. That can be tackled in future changes and pulled into this repo later.

I think this is different than nextstrain/augur#1897 since the inputs are threaded through Snakemake inputs so they have to be resolved as relative paths to the working dir.

Comment thread rules/merge_inputs.smk Outdated
Modified from the dynamic rules in the pathogen-repo-guide.¹ 

The dynamic rules did not work in this repo because it allows different number 
of inputs per segment wildcard. There's no good way to define different rules 
per wildcard so I've opted to use if/else blocks in the shell to run either 
`augur read-file` or `augur merge` depending on the number of inputs. 

¹ <https://github.com/nextstrain/pathogen-repo-guide/blob/cb87e1ac0e6c98d08bab76b954309aeb3bae889b/phylogenetic/rules/merge_inputs.smk>
@jameshadfield
Copy link
Copy Markdown
Member

The config parameters remain unchanged so this should not affect outside users.

[this PR] bumps minimum Snakemake to 8.0.0

Can you touch base with Louise's group re: snakemake v8 (if you haven't already!)

@joverlee521 joverlee521 force-pushed the standardized-multiple-inputs branch from d70d789 to 426f8df Compare October 7, 2025 20:28
Copy link
Copy Markdown
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this repo forward Jover! I haven't tested but I can do so if you'd like? (after fixing Victor's comments)

Comment thread rules/merge_inputs.smk
Comment thread rules/merge_inputs.smk
@joverlee521 joverlee521 force-pushed the standardized-multiple-inputs branch from 426f8df to fc4a6f4 Compare October 7, 2025 21:43
The main motivation for this is to simplify the CI build-args which will 
need to be replicated across augur, docker-base, and conda-base. 

Note the CI config uses the repo's local example_data as inputs and the paths 
are relative to the top level directory, so the workflow can only be run from 
the top level directory.
@joverlee521 joverlee521 force-pushed the standardized-multiple-inputs branch from fc4a6f4 to 154f3c9 Compare October 28, 2025 20:07
joverlee521 added a commit to nextstrain/augur that referenced this pull request Oct 28, 2025
Matching changes in ci builds-args in 
nextstrain/avian-flu#156
joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Oct 28, 2025
Matching changes in ci builds-args in 
nextstrain/avian-flu#156
joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Oct 28, 2025
Matching changes in ci builds-args in 
nextstrain/avian-flu#156
@joverlee521
Copy link
Copy Markdown
Contributor Author

Can you touch base with Louise's group re: snakemake v8 (if you haven't already!)

Ah, realized I never followed up here. I chatted with Louise about the change on Slack and it should be fine with their group because they are using the Nextstrain Docker image.

@joverlee521
Copy link
Copy Markdown
Contributor Author

I believe I've addressed all open comments in this PR. I'll merge this by EOD today if there are no other comments.

@joverlee521 joverlee521 merged commit f963447 into master Oct 28, 2025
6 checks passed
@joverlee521 joverlee521 deleted the standardized-multiple-inputs branch October 28, 2025 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants