- About Nerpa
1.1 Nerpa pipeline
1.2 Supported data types - Installation
2.1. Prerequisites
2.2. Installation from tarball
2.3. Verifying your installation - Running Nerpa
3.1. Quick start
3.2. Command-line options
3.3. Output files - Citation
- Feedback and bug reports
Nerpa is a tool for linking biosynthetic gene clusters (BGCs) to known nonribosomal peptide (NRP) structures. You can read more about the Nerpa algorithm and the practical applications of the tool in our papers. Nerpa is currently developed and maintained by Gurevich Lab at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the Center for Bioinformatics Saar (CBI).
This manual will help you to install and run the tool. Nerpa version 2.1.0 was released on 25.03.2026. The tool is dual-licensed and is available under GPLv3 or Creative Commons BY-NC-SA 4.0, see LICENSE.txt.
The simplified Nerpa pipeline is depicted in the figure below.
Nerpa takes as input an NRP structure database and genome sequences. The pipeline goes as follows:
- Construct tentative NRP synthetase assembly lines along with respective sequences of genome-predicted residues (using antiSMASH for BGC annotation and PARAS for A domain specificity prediction).
- Construct representations of the database structures as monomer graphs (using rBAN).
- Build HMMs for genome-predicted NRP synthetase assembly lines as described in the Nerpa 2 paper.
- Extract NRP linearizations from the monomer graphs.
- Score the NRP linearizations against the HMMs all-vs-all manner (using the Viterbi algorithm).
- Create an interactive report with the best matches and detailed alignments.
For genome sequences:
- Recommended: complete antiSMASH output after processing your raw genome sequence (e.g., downloaded from the antiSMASH web server); or antiSMASH job IDs (in this case, Nerpa will download it automatically).
- Also accepted: raw genome sequences in the FASTA and GenBank formats;
in this case, Nerpa will predict NRP BGCs in them with antiSMASH
(should be installed separately and present in
PATHor provided to Nerpa via--antismash-installation-dir).
For NRP structures:
- Recommended: isomeric SMILES format; Nerpa distinguishes between L- and D-configurations of amino acids, so the use of the isomeric format leads to more accurate results.
- Also accepted: any other SMILES, i.e., without stereochemistry information.
Note: you can use free online converters to get (isomeric) SMILES from other popular chemical formats such as MDL MOL or InChI, e.g., this one from UNM. Alternatively, there are many command-line convertors, e.g. molconvert, or programming libraries, e.g. RDKit.
-
(Required) Nerpa relies on Java (to run the embedded rBAN), Python v3.10 or higher, and a number of Python dependencies specified in the environment.yml file.
We highly recommend installing Conda to easily set up all dependencies, as demonstrated below. -
(Required) Nerpa's core scoring algorithm is implemented in C++ implementation. Please install a C++20 compiler and CMake v3.10 or higher.
-
(Optional) If you plan to use Nerpa with raw genome sequences (FASTA or GenBank) rather than antiSMASH-processed files, you will also need to install antiSMASH locally.
Alternatively, you can use the antiSMASH web server.
First, download and unpack the release tarball:
curl -L -o nerpa-2.1.0.tar.gz https://github.com/gurevichlab/nerpa/releases/download/nerpa_2.1.0/nerpa-2.1.0.tar.gz
tar -xzf nerpa-2.1.0.tar.gz
cd nerpa-2.1.0
Next, install all required dependencies. We recommend creating and activating a Conda environment:
conda env create -f environment.yml
conda activate nerpa-env
Finally, download the PARAS prediction model and compile the C++ code by running:
bash install.sh
We recommend adding the nerpa directory to PATH. In this case, you can run Nerpa simply as nerpa.py from anywhere; otherwise, you would need to specify the path from the current directory to ./nerpa.py.
All running examples below assume that Nerpa is in PATH.
To test your installation, first, try to get the list of the Nerpa command-line options:
nerpa.py -h
Then, try any example from the Quick start section and ensure the log contains no error messages.
If you have any problems, please do not hesitate to contact us.
Sample test data with three antiSMASH-processed BGCs and three NRP structures in the SMILES format is included in the release tarball.
Alternatively, you can download it from here and unpack it in your current working directory.
To run Nerpa on the test data, execute:
nerpa.py -a test_data/antismash --smiles-tsv test_data/smiles.tsv --col-id ID --output-dir nerpa_results/test_run
For details on the output directory contents and their interpretation refer to the corresponding section.
To see the full list of available options, type
nerpa.py -h
All options are divided into four categories. The most important options in each category are listed below.
The most convenient way to obtain antiSMASH predictions of BGCs in your genomic data is to upload your
FASTA or GenBank file to their web server.
Once the server job is completed, download the results (Download -> Download all results), unpack the archive,
and provide the path to the unpacked directory using the -a option.
Alternatively, you can provide Nerpa with the server job ID (e.g., bacteria-2a9bb79e-e804-42c9-bb62-516cac47eca2)
via the --antismash-job-ids option, and Nerpa will download everything automatically.
For multiple jobs, specify them as a space-separated list of IDs.
You can also use the command-line version of antiSMASH.
Nerpa has been tested with outputs from antiSMASH version 7 (7.0.0 and 7.1.0).
If antiSMASH is installed on your system, you can provide raw genome sequences in FASTA or GenBank format via the --genome option,
and Nerpa will run antiSMASH automatically.
To enable this, antiSMASH should be available in your system’s PATH variable, or the path to its installation directory should be specified via the --antismash-installation-path option.
Note that you can specify an unlimited number of antiSMASH output files by either:
- Using the
-aoption multiple times. - Specifying a root directory containing many inputs.
- Writing paths to all antiSMASH outputs in a single text file and providing it via the
--antismash-paths-fileoption.
NRP molecules should be specified in the SMILES format.
You can provide them in one of the following ways:
- As a space-separated list of SMILES strings using the
--smilesoption. - In a multi-column file specified via the
--smiles-tsvoption.
In the latter case, the default column separator (\t), names of the SMILES column (SMILES) and the column with molecule IDs
(row index) could be adjusted using the --sep, --col-smiles, and --col-id options, respectively.
The Nerpa release package comes with a set of NRP databases in the SMILES format:
- Compounds from MIBiG 4.0 and Norine, available in data/input/mibig_norine.tsv.
- Our own database of putative NRP structures, pNRPdb, available in data/input/pnrpdb2.tsv.
You can reuse preprocessed BGCs and/or chemical structures from a previous Nerpa run.
This can save much time if, for example, you want to make several Nerpa runs with the same NRP database.
The preprocessed files are stored in the Nerpa output directory in:
preprocessed_input/BGC_variants.yaml(for BGCs).preprocessed_input/parsed_rban_records.yaml(for NRPs).
To reuse them, provide the corresponding paths via the --bgc-variants and --parsed-rban-records options.
-
--output-dir <DIR>, -o <DIR>
Path to the output directory.
If the directory already exists, Nerpa will exit with an error unless--force-output-diris specified.
If not set, Nerpa will create the directorynerpa_results/{CURRENT_TIME}and symlink it tonerpa_results/latest. -
--process-hybrids
Process NRP-polyketide hybrid monomers (requires rBAN to be used). Recommended. -
--threads
Number of threads for running Nerpa. Default:1. -
--skip-molecule-drawing
Disable drawing of NRP compounds (they will not appear in the HTML report). Enabling this option speeds up the run and reduces output size.
The key files and directories inside the Nerpa output directory (--output-dir) are:
-
report.html
An interactive HTML report showing the best Nerpa matches, along with the corresponding annotated BGCs and NRPs. -
report.tsv
A tab-separated file containing matched NRP-BGC pairs with their corresponding scores. -
preprocessed_input/
Directory containing preprocessed data.BGC_variants.yamlandparsed_rban_records.yaml -
can be reused for another run via the
--bgc-variantsand--parsed-rban-recordsoptions.
If you use Nerpa in your research, please cite our papers:
Nerpa v.2 is described in Olkhovskii et al, bioRxiv 2024.
Nerpa v.1 is published in Kunyavskaya, Tagirdzhanov et al., Metabolites 2021.
You can leave your comments and bug reports at https://github.com/gurevichlab/nerpa/issues (recommended way) or sent it via e-mail to alexey.gurevich@helmholtz-hips.de.
Your comments, bug reports, and suggestions are very welcomed. They will help us to improve Nerpa further. In particular, we would love to hear your thought on desired features of the future Nerpa web service.
If you have any troubles running Nerpa, please attach nerpa.log and warnings.log from the output directory.

