Installation Guide

Complete installation instructions for poster2json.

Prerequisites
Standard Installation (Linux/macOS)
Windows Installation
Installing pdfalto
Verifying Installation
Troubleshooting

Prerequisites

Hardware Requirements

GPU: NVIDIA CUDA-capable GPU with ≥16GB VRAM
- ≥24GB recommended for running both models simultaneously
RAM: ≥32GB system memory
Storage: ~50GB for models and dependencies

Software Requirements

Python 3.10+
CUDA 11.8+ with compatible NVIDIA drivers
Git

Standard Installation (Linux/macOS)

Option A: pip install from GitHub (Recommended)

pip install git+https://github.com/fairdataihub/posters-science-extraction-api.git

This installs poster2json and all dependencies. You can then run:

poster2json --annotation-dir ./posters --output-dir ./output

Option B: Clone and Install (Development)

git clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
pip install -e .  # Editable install

Option C: Requirements Only

git clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Install pdfalto

See Installing pdfalto below.

4. Verify Installation

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from transformers import AutoTokenizer; print('Transformers OK')"

Windows Installation

Windows users have two options:

Option A: Docker (Recommended)

Docker provides the simplest cross-platform experience. See DOCKER.md for complete instructions.

docker compose up --build

Option B: WSL2

Install WSL2 with Ubuntu:
```
wsl --install -d Ubuntu-22.04
```
Install NVIDIA CUDA support for WSL2:
- Download from NVIDIA CUDA WSL
Follow the Linux installation steps above inside WSL2.

Installing pdfalto

pdfalto is required for PDF text extraction with layout preservation.

Option A: Build with Docker (All Platforms)

The easiest cross-platform method. Produces a Linux binary for Docker/WSL2 use.

# Clone pdfalto
git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto

# Create build Dockerfile
cat > Dockerfile.build << 'EOF'
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential cmake git && rm -rf /var/lib/apt/lists/*
WORKDIR /pdfalto
COPY . .
RUN cmake . && make -j$(nproc)
EOF

# Build and extract binary
docker build -f Dockerfile.build -t pdfalto-builder .
container=$(docker create pdfalto-builder)
docker cp "${container}":/pdfalto/pdfalto ./pdfalto
docker rm "${container}"

# Move to posters-science-extraction-api
mv ./pdfalto /path/to/posters-science-extraction-api/executables/pdfalto
chmod +x /path/to/posters-science-extraction-api/executables/pdfalto

Option B: Build from Source (Linux/macOS)

Requires cmake and a C++ compiler (gcc/clang).

git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto
cmake .
make -j$(nproc)
# Binary at: ./pdfalto

Option C: Pre-built Binary

Check pdfalto releases for pre-built binaries.

Configure pdfalto Path

The pipeline searches these locations automatically:

PDFALTO_PATH environment variable (recommended)
./executables/pdfalto (in repository)
System PATH (which pdfalto)
/usr/local/bin/pdfalto
~/Downloads/pdfalto

Set the environment variable:

export PDFALTO_PATH="/path/to/pdfalto"

Or add to your shell profile (~/.bashrc, ~/.zshrc):

echo 'export PDFALTO_PATH="/path/to/pdfalto"' >> ~/.bashrc
source ~/.bashrc

Verifying Installation

Run the test suite on the included example posters:

python poster_extraction.py \
    --annotation-dir ./example_posters \
    --output-dir ./test_output

Expected output:

JSON files in ./test_output/
Console shows extraction progress and metrics

Troubleshooting

CUDA Not Available

>>> import torch
>>> torch.cuda.is_available()
False

Solutions:

Verify NVIDIA drivers: nvidia-smi
Reinstall PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu118

pdfalto Not Found

WARNING: pdfalto not found, falling back to PyMuPDF

Solutions:

Set PDFALTO_PATH environment variable
Place binary in ./executables/pdfalto
Verify binary is executable: chmod +x pdfalto

Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

Close other GPU applications
Use 8-bit quantization (automatic for <16GB GPUs)
Process PDFs and images separately

Model Download Issues

OSError: We couldn't connect to huggingface.co

Solutions:

Check internet connection
Use offline mode with pre-downloaded models

Environment Variables

Variable	Description	Default
`PDFALTO_PATH`	Path to pdfalto binary	Auto-detected
`CUDA_VISIBLE_DEVICES`	GPU device(s) to use	All available

Next Steps

Docker Setup - Container deployment
API Reference - REST API usage
Architecture - Technical details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation Guide

Table of Contents

Prerequisites

Hardware Requirements

Software Requirements

Standard Installation (Linux/macOS)

Option A: pip install from GitHub (Recommended)

Option B: Clone and Install (Development)

Option C: Requirements Only

3. Install pdfalto

4. Verify Installation

Windows Installation

Option A: Docker (Recommended)

Option B: WSL2

Installing pdfalto

Option A: Build with Docker (All Platforms)

Option B: Build from Source (Linux/macOS)

Option C: Pre-built Binary

Configure pdfalto Path

Verifying Installation

Troubleshooting

CUDA Not Available

pdfalto Not Found

Out of Memory

Model Download Issues

Environment Variables

Next Steps

FilesExpand file tree

INSTALLATION.md

Latest commit

History

INSTALLATION.md

File metadata and controls

Installation Guide

Table of Contents

Prerequisites

Hardware Requirements

Software Requirements

Standard Installation (Linux/macOS)

Option A: pip install from GitHub (Recommended)

Option B: Clone and Install (Development)

Option C: Requirements Only

3. Install pdfalto

4. Verify Installation

Windows Installation

Option A: Docker (Recommended)

Option B: WSL2

Installing pdfalto

Option A: Build with Docker (All Platforms)

Option B: Build from Source (Linux/macOS)

Option C: Pre-built Binary

Configure pdfalto Path

Verifying Installation

Troubleshooting

CUDA Not Available

pdfalto Not Found

Out of Memory

Model Download Issues

Environment Variables

Next Steps