This repository provides source code for several pipelines dedicated to the alignment of nucleotide coding sequences that are based on MACSE. These pipelines are mostly bash scripts encapsulated within Singularity containers and sometimes combined into NextFlow workflows.
- alfix: this pipeline uses MACSE and HmmCleaner to produce a high quality alignment of nucleotide (NT) coding sequences using their amino acid (AA) translations. It is well suited for datasets containing a few dozen of sequences of a few Kb.
- OMM_MACSE: this pipeline also produces a codon-aware alignment thanks to MACSE, which could be filtered by HmmCleaner, but it can handle larger datasets by relying on MAFFT, MUSCLE or PRANK to scale up.
These two pipelines are described in our MACSE tutorial paper [ranwez et al. 2020]
- macse_barcode this nextflow pipeline allows to aligns hundred of thousands of barcoding sequences
- build_ref_align this nextflow pipeline identifies a small subset of sequences that are representative of the diversity of the barcoding input sequence dataset
- enrich_align this nextflow pipeline aligns barcoding sequences based on a reference alignment.
- representative_sequences A bash script that identifies a small subset of sequences that are representative of the diversity of the barcoding input sequence dataset and is chained with OMM_MACSE in the build_ref_align workflow.
These pipelines are detailed in our book chapter dedicated to MACSE and barcoding datasets [Delscuc & Ranwez, 2020]. While using macse_barcode is the easiest solution, chaining build_ref_align and enrich_align allows to check the quality of the proposed reference alignment and to manually curate it, if needed, before using it to align the barcode sequences.
We used the macse_barcode pipeline to align COI, matK and rbcL sequences for numerous taxonomic groups, all resulting alignments are available here and on Zenodo (thanks to Roderic Page).
MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons.
A wide range of molecular analyses relies on multiple sequence alignments (MSA). Until now the most efficient solution to align nucleotide (NT) sequences containing open reading frames was to use indirect procedures that align amino acid (AA) translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.
MACSE [Ranwez et al, 2011] aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.
For further details about the underlying algorithm see the original publication: MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery PLoS One 2011, 6(9): e22594.
More information (including documentations and tutorials) are available on the MACSE website
A singularity container [Kurtzer, 2017] contains everything that is needed to execute a specific task. The person building the container has to handle dependencies and environment configuration so that the end-user do not need to bother. The file specifying the construction of the container is a simple text file called a recipe (we provide the recipe of our container as well as the containers). As our scripts/pipelines often relies on several other scripts and external tools (e.g. MAFFT) singularity container is very handy as the end user just need to install singularity and download the container without having to care for installing dependencies or setting environment variables.
A brief introduction to singularity is available here. If you got an error message stating that your input file does not exist it is probably related to the fact that the folder containing them is not visible from the singularity container. A solution found by one user is to use the SINGULARITY_BINDPATH variable:
export SINGULARITY_BINDPATH="/path/to/fasta"
Note: Singularity is now called Apptainer. OMM_MACSE v12.02 and later have been built and tested with Apptainer on several HPC clusters.
Nextflow [Di Tommaso, 2017] enables scalable and reproducible scientific workflowsusing software containers allowing the adaptation of pipelines written in the most commonscripting languages.
Nextflow separates the workflow itself from the directive regarding the correct way to execute it in the environment. One key advantage of Nextflow is that, by changing slightly the “nextflow.config” file, the same workflow will be parallelized and launched to exploit the full resources of a high performance computing (HPC) cluster.
- Seaview - A great Multiplatform GUI for molecular phylogeny.
- SeqTUI - A fast terminal-based viewer and command-line toolkit for sequences. View, translate, convert (to FASTA), and combine sequences aligned or not — all from the terminal.
-
Nextflow enables reproducible computational workflows. Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame, C.(2017). Nature Biotechnology,35(4):316–319. Nextflow web site
-
Singularity: Scientific containers formobility of compute. Kurtzer, G. M., Sochat, V., and Bauer, M. W. (2017). PloS One, 12(5):e0177459. singularity web site
-
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. V. Ranwez, S. Harispe, F., E. JP Douzery PLoS One 2011, 6(9): e22594. MACSE web site
-
MACSE v2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.
V. Ranwez, E. JP Douzery, C. Cambon, N. Chantret, F. Delsuc. Molecular Biology and Evolution 2018, 35(10):2582–2584. doi:10.1093/molbev/msy159. -
Aligning Protein-Coding Nucleotide Sequences with MACSE. Ranwez V, Chantret N, Delsuc F. Methods Mol Biol. 2021;2231:51–70. doi:10.1007/978-1-0716-1036-7_4. PMID: 33289886.
-
Accurate alignment of (meta)barcoding datasets using MACSE. F. Delsuc and V. Ranwez (2020). In Scornavacca, C., Delsuc, F., and Galtier, N., editors, Phylogenetics in the Genomic Era, chapter No. 2.3, pp. 2.3:1–2.3:30, https://hal.inria.fr/PGE.