Skip to content

Latest commit

 

History

History
225 lines (162 loc) · 9.01 KB

File metadata and controls

225 lines (162 loc) · 9.01 KB

Pyserini: Detailed Installation Guide

Pyserini is built on Python 3.12 (other versions might work, but YMMV). See pyproject.toml for a detailed list of dependencies. At a high level:

  • Pyserini depends on Anserini, which is built on Lucene. PyJNIus is used to interact with the JVM. We depend on Java 21.
  • We need PyTorch, 🤗 Transformers, and the ONNX Runtime for "neural stuff".
  • We need Faiss for searching dense vectors. (Although we can search dense vectors with Lucene also.)

A pip installation should automatically pull in major dependencies without any major issues:

pip install pyserini

❗ For PyTorch, 🤗 Transformers, and the ONNX Runtime: sometimes pip has issues pulling in the "right" versions. If this is the case, it might make sense to install these dependencies first (selecting the right version by hand, perhaps via conda).

Faiss is not included in the dependencies list because there is a proliferation of variants (faiss-cpu, faiss-gpu, etc.), so it's easier if you install the right variant yourself.

Multimodal support (e.g., image search) has been pushed into an optional package. If you need it, install 'pyserini[optional]':

pip install 'pyserini[optional]'

tl;dr — 🤞

PyPI Installation Walkthrough

Below is a step-by-step Pyserini installation guide based on Python 3.12. We recommend using Anaconda and assume you have already installed it. The following instructions are up to date as of April 2026 and should work.

Mac

If you're on a Mac with an M-series (i.e., ARM) processor:

conda create -n pyserini python=3.12 -y
conda activate pyserini

# Inside the new environment...
conda install -c anaconda wget -y
conda install -c conda-forge openjdk=21 maven -y

# from https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio

# If you want the optional dependencies, otherwise skip
conda install -c pytorch faiss-cpu -y

# Good idea to always explicitly specify the latest version, found here: https://pypi.org/project/pyserini/
pip install pyserini==latest
# If you want the optional dependencies, otherwise skip; the temperamental packages are already installed at this point
# so should be smooth...
pip install 'pyserini==latest'

Linux

Follow the recipe below:

conda create -n pyserini python=3.12 -y
conda activate pyserini

# Inside the new environment...
conda install -c conda-forge openjdk=21 maven -y

# from https://pytorch.org/get-started/locally/
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# If you have a CUDA-enabled GPU and want to use it (e.g., with --device cuda), install the CUDA version of PyTorch instead
# Find the right command for your CUDA version at https://pytorch.org/get-started/locally/, for example, for CUDA 12.6:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# If you want the optional dependencies, otherwise skip
conda install -c pytorch faiss-cpu -y

# Good idea to always explicitly specify the latest version, found here: https://pypi.org/project/pyserini/
pip install pyserini==latest
# If you want the optional dependencies, otherwise skip; the temperamental packages are already installed at this point
# so should be smooth...
pip install 'pyserini==latest'

Verifying the Installation

By this point, Pyserini should have been installed. It might be worthwhile to do a bit of sanity checking, per below.

To confirm that bag-of-words retrieval is working correctly, you can run the BM25 baseline on the MS MARCO passage ranking task:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-passage \
  --topics msmarco-passage-dev-subset \
  --output run.msmarco-v1-passage.bm25-default.dev.txt \
  --bm25 --output-format msmarco

python -m pyserini.eval.msmarco_passage_eval \
  msmarco-passage-dev-subset run.msmarco-v1-passage.bm25-default.dev.txt

Expected Results:

MRR @10: 0.18741227770955546

To confirm that dense retrieval is working correctly with Lucene using an HNSW index:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 --dense --hnsw \
  --index msmarco-v1-passage.bge-base-en-v1.5.hnsw \
  --topics dl19-passage \
  --onnx-encoder BgeBaseEn15 \
  --output run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt \
  --hits 1000 --ef-search 1000

python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage \
  run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage \
  run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt
python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage \
  run.msmarco-v1-passage.bge-base-en-v1.5.lucene-hnsw.onnx.dl19.txt

Expected Results:

map                       all    0.4486
ndcg_cut_10               all    0.7016
recall_1000               all    0.8441

To confirm that dense retrieval is working correctly with Faiss, run our TCT-ColBERT (v2) model on the MS MARCO passage ranking task:

python -m pyserini.search.faiss \
  --topics msmarco-passage-dev-subset \
  --index msmarco-v1-passage.tct_colbert-v2-hnp \
  --encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
  --threads 12 --batch-size 384 \
  --output run.msmarco-passage.tct_colbert-v2.bf.tsv \
  --output-format msmarco

python -m pyserini.eval.msmarco_passage_eval \
  msmarco-passage-dev-subset run.msmarco-passage.tct_colbert-v2.bf.tsv

Expected Results:

#####################
MRR @10: 0.3584
QueriesRanked: 6980
#####################

If you've gotten to here, then everything should be working properly.

Development Installation

If you're planning on just using Pyserini, then the instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation.

Start the same way as the install above, but don't install pip install pyserini.

Instead, clone the Pyserini repo with the --recurse-submodules option to make sure the tools/ submodule also gets cloned:

git clone git@github.com:castorini/pyserini.git --recurse-submodules

The tools/ directory, which contains evaluation tools and scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects). Change into the pyserini subdirectory and build as follows (you might get warnings, but okay to ignore):

cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Then, in the pyserini clone, use pip to add an "editable" installation, as follows:

pip install -e .

Next, you'll need to clone and build Anserini. It makes sense to put both pyserini/ and anserini/ in a common folder. After you've successfully built Anserini, copy the fatjar, which will be target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar into pyserini/resources/jars/. As with the pip installation, a potential source of frustration is incompatibility among different versions of underlying dependencies.

You can confirm everything is working by running the unit tests:

python -m unittest discover -s tests

Assuming all tests pass, you should be ready to go!

Troubleshooting Tips

The above guide handles JVM installation via Conda. If you are using your own Java environment and get an error about Java version mismatch, it's likely an issue with your JAVA_HOME environmental variable. In bash, use echo $JAVA_HOME to find out what the environmental variable is currently set to, and use export JAVA_HOME=/path/to/java/home to change it to the correct path. On a Linux system, the correct path might look something like /usr/lib/jvm/java-21. Unfortunately, we are unable to offer more concrete advice since the actual path depends on your OS, which JDK you're using, and a host of other factors.

Internal Notes

At the University of Waterloo, we have two (CPU) development servers, tuna and orca. Note that on these two servers, the root disk (where your home directory is mounted) doesn't have much space. So, you need to set pyserini cache path to scratch space.

  • For tuna, create the dir /tuna1/scratch/{username}
  • For orca, create the dir /store/scratch/{username}

Set the PYSERINI_CACHE environment variable to point to the directory you created above.