Complete installation instructions for poster2json.
- Prerequisites
- Standard Installation (Linux/macOS)
- Windows Installation
- Installing pdfalto
- Verifying Installation
- Troubleshooting
- GPU: NVIDIA CUDA-capable GPU with ≥16GB VRAM
- ≥24GB recommended for running both models simultaneously
- RAM: ≥32GB system memory
- Storage: ~50GB for models and dependencies
- Python 3.10+
- CUDA 11.8+ with compatible NVIDIA drivers
- Git
pip install git+https://github.com/fairdataihub/posters-science-extraction-api.gitThis installs poster2json and all dependencies. You can then run:
poster2json --annotation-dir ./posters --output-dir ./outputgit clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
pip install -e . # Editable installgit clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtSee Installing pdfalto below.
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from transformers import AutoTokenizer; print('Transformers OK')"Windows users have two options:
Docker provides the simplest cross-platform experience. See DOCKER.md for complete instructions.
docker compose up --build-
Install WSL2 with Ubuntu:
wsl --install -d Ubuntu-22.04
-
Install NVIDIA CUDA support for WSL2:
- Download from NVIDIA CUDA WSL
-
Follow the Linux installation steps above inside WSL2.
pdfalto is required for PDF text extraction with layout preservation.
The easiest cross-platform method. Produces a Linux binary for Docker/WSL2 use.
# Clone pdfalto
git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto
# Create build Dockerfile
cat > Dockerfile.build << 'EOF'
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential cmake git && rm -rf /var/lib/apt/lists/*
WORKDIR /pdfalto
COPY . .
RUN cmake . && make -j$(nproc)
EOF
# Build and extract binary
docker build -f Dockerfile.build -t pdfalto-builder .
container=$(docker create pdfalto-builder)
docker cp "${container}":/pdfalto/pdfalto ./pdfalto
docker rm "${container}"
# Move to posters-science-extraction-api
mv ./pdfalto /path/to/posters-science-extraction-api/executables/pdfalto
chmod +x /path/to/posters-science-extraction-api/executables/pdfaltoRequires cmake and a C++ compiler (gcc/clang).
git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto
cmake .
make -j$(nproc)
# Binary at: ./pdfaltoCheck pdfalto releases for pre-built binaries.
The pipeline searches these locations automatically:
PDFALTO_PATHenvironment variable (recommended)./executables/pdfalto(in repository)- System PATH (
which pdfalto) /usr/local/bin/pdfalto~/Downloads/pdfalto
Set the environment variable:
export PDFALTO_PATH="/path/to/pdfalto"Or add to your shell profile (~/.bashrc, ~/.zshrc):
echo 'export PDFALTO_PATH="/path/to/pdfalto"' >> ~/.bashrc
source ~/.bashrcRun the test suite on the included example posters:
python poster_extraction.py \
--annotation-dir ./example_posters \
--output-dir ./test_outputExpected output:
- JSON files in
./test_output/ - Console shows extraction progress and metrics
>>> import torch
>>> torch.cuda.is_available()
FalseSolutions:
- Verify NVIDIA drivers:
nvidia-smi - Reinstall PyTorch with CUDA:
pip install torch --index-url https://download.pytorch.org/whl/cu118
WARNING: pdfalto not found, falling back to PyMuPDF
Solutions:
- Set
PDFALTO_PATHenvironment variable - Place binary in
./executables/pdfalto - Verify binary is executable:
chmod +x pdfalto
torch.cuda.OutOfMemoryError: CUDA out of memory
Solutions:
- Close other GPU applications
- Use 8-bit quantization (automatic for <16GB GPUs)
- Process PDFs and images separately
OSError: We couldn't connect to huggingface.co
Solutions:
- Check internet connection
- Use offline mode with pre-downloaded models
| Variable | Description | Default |
|---|---|---|
PDFALTO_PATH |
Path to pdfalto binary | Auto-detected |
CUDA_VISIBLE_DEVICES |
GPU device(s) to use | All available |
- Docker Setup - Container deployment
- API Reference - REST API usage
- Architecture - Technical details