Nerpa 2.1 Manual

About Nerpa
1.1 Nerpa pipeline
1.2 Supported data types
Installation
2.1. Prerequisites
2.2. Installation from tarball
2.3. Verifying your installation
Running Nerpa
3.1. Quick start
3.2. Command-line options
3.3. Output files
Citation
Feedback and bug reports

About Nerpa

Nerpa is a tool for linking biosynthetic gene clusters (BGCs) to known nonribosomal peptide (NRP) structures. You can read more about the Nerpa algorithm and the practical applications of the tool in our papers. Nerpa is currently developed and maintained by Gurevich Lab at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the Center for Bioinformatics Saar (CBI).

This manual will help you to install and run the tool. Nerpa version 2.1.0 was released on 25.03.2026. The tool is dual-licensed and is available under GPLv3 or Creative Commons BY-NC-SA 4.0, see LICENSE.txt.

Nerpa pipeline

The simplified Nerpa pipeline is depicted in the figure below.

Nerpa takes as input an NRP structure database and genome sequences. The pipeline goes as follows:

Construct tentative NRP synthetase assembly lines along with respective sequences of genome-predicted residues (using antiSMASH for BGC annotation and PARAS for A domain specificity prediction).
Construct representations of the database structures as monomer graphs (using rBAN).
Build HMMs for genome-predicted NRP synthetase assembly lines as described in the Nerpa 2 paper.
Extract NRP linearizations from the monomer graphs.
Score the NRP linearizations against the HMMs all-vs-all manner (using the Viterbi algorithm).
Create an interactive report with the best matches and detailed alignments.

Supported data types

For genome sequences:

Recommended: complete antiSMASH output after processing your raw genome sequence (e.g., downloaded from the antiSMASH web server); or antiSMASH job IDs (in this case, Nerpa will download it automatically).
Also accepted: raw genome sequences in the FASTA and GenBank formats; in this case, Nerpa will predict NRP BGCs in them with antiSMASH (should be installed separately and present in PATH or provided to Nerpa via --antismash-installation-dir).

For NRP structures:

Recommended: isomeric SMILES format; Nerpa distinguishes between L- and D-configurations of amino acids, so the use of the isomeric format leads to more accurate results.
Also accepted: any other SMILES, i.e., without stereochemistry information.

Note: you can use free online converters to get (isomeric) SMILES from other popular chemical formats such as MDL MOL or InChI, e.g., this one from UNM. Alternatively, there are many command-line convertors, e.g. molconvert, or programming libraries, e.g. RDKit.

Installation

Prerequisites

(Required) Nerpa relies on Java (to run the embedded rBAN), Python v3.10 or higher, and a number of Python dependencies specified in the environment.yml file.
We highly recommend installing Conda to easily set up all dependencies, as demonstrated below.
(Required) Nerpa's core scoring algorithm is implemented in C++ implementation. Please install a C++20 compiler and CMake v3.10 or higher.
(Optional) If you plan to use Nerpa with raw genome sequences (FASTA or GenBank) rather than antiSMASH-processed files, you will also need to install antiSMASH locally.
Alternatively, you can use the antiSMASH web server.

Installation from tarball

First, download and unpack the release tarball:

curl -L -o nerpa-2.1.0.tar.gz https://github.com/gurevichlab/nerpa/releases/download/nerpa_2.1.0/nerpa-2.1.0.tar.gz
tar -xzf nerpa-2.1.0.tar.gz
cd nerpa-2.1.0

Next, install all required dependencies. We recommend creating and activating a Conda environment:

conda env create -f environment.yml
conda activate nerpa-env

Finally, download the PARAS prediction model and compile the C++ code by running:

bash install.sh

Verifying your installation

We recommend adding the nerpa directory to PATH. In this case, you can run Nerpa simply as nerpa.py from anywhere; otherwise, you would need to specify the path from the current directory to ./nerpa.py. All running examples below assume that Nerpa is in PATH.

To test your installation, first, try to get the list of the Nerpa command-line options:

nerpa.py -h

Then, try any example from the Quick start section and ensure the log contains no error messages.

If you have any problems, please do not hesitate to contact us.

Running Nerpa

Quick start

Sample test data with three antiSMASH-processed BGCs and three NRP structures in the SMILES format is included in the release tarball.
Alternatively, you can download it from here and unpack it in your current working directory.

To run Nerpa on the test data, execute:

nerpa.py -a test_data/antismash --smiles-tsv test_data/smiles.tsv --col-id ID --output-dir nerpa_results/test_run

For details on the output directory contents and their interpretation refer to the corresponding section.

Command-line options

To see the full list of available options, type

nerpa.py -h

All options are divided into four categories. The most important options in each category are listed below.

Genomic input (genome sequences or BGCs)

The most convenient way to obtain antiSMASH predictions of BGCs in your genomic data is to upload your
FASTA or GenBank file to their web server.
Once the server job is completed, download the results (Download -> Download all results), unpack the archive,
and provide the path to the unpacked directory using the -a option.

Alternatively, you can provide Nerpa with the server job ID (e.g., bacteria-2a9bb79e-e804-42c9-bb62-516cac47eca2)
via the --antismash-job-ids option, and Nerpa will download everything automatically.
For multiple jobs, specify them as a space-separated list of IDs.

You can also use the command-line version of antiSMASH.
Nerpa has been tested with outputs from antiSMASH version 7 (7.0.0 and 7.1.0).
If antiSMASH is installed on your system, you can provide raw genome sequences in FASTA or GenBank format via the --genome option,
and Nerpa will run antiSMASH automatically.
To enable this, antiSMASH should be available in your system’s PATH variable, or the path to its installation directory should be specified via the --antismash-installation-path option.

Note that you can specify an unlimited number of antiSMASH output files by either:

Using the -a option multiple times.
Specifying a root directory containing many inputs.
Writing paths to all antiSMASH outputs in a single text file and providing it via the --antismash-paths-file option.

Chemical input (compounds)

NRP molecules should be specified in the SMILES format.
You can provide them in one of the following ways:

As a space-separated list of SMILES strings using the --smiles option.
In a multi-column file specified via the --smiles-tsv option.

In the latter case, the default column separator (\t), names of the SMILES column (SMILES) and the column with molecule IDs (row index) could be adjusted using the --sep, --col-smiles, and --col-id options, respectively.

The Nerpa release package comes with a set of NRP databases in the SMILES format:

Compounds from MIBiG 4.0 and Norine, available in data/input/mibig_norine.tsv.
Our own database of putative NRP structures, pNRPdb, available in data/input/pnrpdb2.tsv.

Advanced Input

You can reuse preprocessed BGCs and/or chemical structures from a previous Nerpa run.
This can save much time if, for example, you want to make several Nerpa runs with the same NRP database.

The preprocessed files are stored in the Nerpa output directory in:

preprocessed_input/BGC_variants.yaml (for BGCs).
preprocessed_input/parsed_rban_records.yaml (for NRPs).

To reuse them, provide the corresponding paths via the --bgc-variants and --parsed-rban-records options.

Pipeline Options

--output-dir <DIR>, -o <DIR>
Path to the output directory.
If the directory already exists, Nerpa will exit with an error unless --force-output-dir is specified.
If not set, Nerpa will create the directory nerpa_results/{CURRENT_TIME} and symlink it to nerpa_results/latest.
--process-hybrids
Process NRP-polyketide hybrid monomers (requires rBAN to be used). Recommended.
--threads
Number of threads for running Nerpa. Default: 1.
--skip-molecule-drawing
Disable drawing of NRP compounds (they will not appear in the HTML report). Enabling this option speeds up the run and reduces output size.

Output Files

The key files and directories inside the Nerpa output directory (--output-dir) are:

report.html
An interactive HTML report showing the best Nerpa matches, along with the corresponding annotated BGCs and NRPs.
report.tsv
A tab-separated file containing matched NRP-BGC pairs with their corresponding scores.
preprocessed_input/
Directory containing preprocessed data. BGC_variants.yaml and parsed_rban_records.yaml
can be reused for another run via the --bgc-variants and --parsed-rban-records options.

Citation

If you use Nerpa in your research, please cite our papers:
Nerpa v.2 is described in Olkhovskii et al, bioRxiv 2024.
Nerpa v.1 is published in Kunyavskaya, Tagirdzhanov et al., Metabolites 2021.

Feedback and bug reports

You can leave your comments and bug reports at https://github.com/gurevichlab/nerpa/issues (recommended way) or sent it via e-mail to alexey.gurevich@helmholtz-hips.de.

Your comments, bug reports, and suggestions are very welcomed. They will help us to improve Nerpa further. In particular, we would love to hear your thought on desired features of the future Nerpa web service.

If you have any troubles running Nerpa, please attach nerpa.log and warnings.log from the output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 1,192 Commits
_run_configurations		_run_configurations
configs		configs
data		data
docs/img		docs/img
external_tools/rBAN		external_tools/rBAN
matches_inspection		matches_inspection
notebooks		notebooks
paras		paras
scripts		scripts
src		src
test_data		test_data
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
VERSION.txt		VERSION.txt
benchmarking_plots.py		benchmarking_plots.py
debug.py		debug.py
environment.yml		environment.yml
install.sh		install.sh
nerpa.py		nerpa.py
test_nerpa.py		test_nerpa.py
train_nerpa.py		train_nerpa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nerpa 2.1 Manual

About Nerpa

Nerpa pipeline

Supported data types

Installation

Prerequisites

Installation from tarball

Verifying your installation

Running Nerpa

Quick start

Command-line options

Genomic input (genome sequences or BGCs)

Chemical input (compounds)

Advanced Input

Pipeline Options

Output Files

Citation

Feedback and bug reports

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nerpa 2.1 Manual

About Nerpa

Nerpa pipeline

Supported data types

Installation

Prerequisites

Installation from tarball

Verifying your installation

Running Nerpa

Quick start

Command-line options

Genomic input (genome sequences or BGCs)

Chemical input (compounds)

Advanced Input

Pipeline Options

Output Files

Citation

Feedback and bug reports

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages