Skip to content

Latest commit

 

History

History
239 lines (168 loc) · 5.56 KB

File metadata and controls

239 lines (168 loc) · 5.56 KB

Installation Guide

Complete installation instructions for poster2json.

Table of Contents

Prerequisites

Hardware Requirements

  • GPU: NVIDIA CUDA-capable GPU with ≥16GB VRAM
    • ≥24GB recommended for running both models simultaneously
  • RAM: ≥32GB system memory
  • Storage: ~50GB for models and dependencies

Software Requirements

  • Python 3.10+
  • CUDA 11.8+ with compatible NVIDIA drivers
  • Git

Standard Installation (Linux/macOS)

Option A: pip install from GitHub (Recommended)

pip install git+https://github.com/fairdataihub/posters-science-extraction-api.git

This installs poster2json and all dependencies. You can then run:

poster2json --annotation-dir ./posters --output-dir ./output

Option B: Clone and Install (Development)

git clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
pip install -e .  # Editable install

Option C: Requirements Only

git clone https://github.com/fairdataihub/posters-science-extraction-api.git
cd posters-science-extraction-api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Install pdfalto

See Installing pdfalto below.

4. Verify Installation

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
python -c "from transformers import AutoTokenizer; print('Transformers OK')"

Windows Installation

Windows users have two options:

Option A: Docker (Recommended)

Docker provides the simplest cross-platform experience. See DOCKER.md for complete instructions.

docker compose up --build

Option B: WSL2

  1. Install WSL2 with Ubuntu:

    wsl --install -d Ubuntu-22.04
  2. Install NVIDIA CUDA support for WSL2:

  3. Follow the Linux installation steps above inside WSL2.

Installing pdfalto

pdfalto is required for PDF text extraction with layout preservation.

Option A: Build with Docker (All Platforms)

The easiest cross-platform method. Produces a Linux binary for Docker/WSL2 use.

# Clone pdfalto
git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto

# Create build Dockerfile
cat > Dockerfile.build << 'EOF'
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y build-essential cmake git && rm -rf /var/lib/apt/lists/*
WORKDIR /pdfalto
COPY . .
RUN cmake . && make -j$(nproc)
EOF

# Build and extract binary
docker build -f Dockerfile.build -t pdfalto-builder .
container=$(docker create pdfalto-builder)
docker cp "${container}":/pdfalto/pdfalto ./pdfalto
docker rm "${container}"

# Move to posters-science-extraction-api
mv ./pdfalto /path/to/posters-science-extraction-api/executables/pdfalto
chmod +x /path/to/posters-science-extraction-api/executables/pdfalto

Option B: Build from Source (Linux/macOS)

Requires cmake and a C++ compiler (gcc/clang).

git clone --recurse-submodules https://github.com/kermitt2/pdfalto.git
cd pdfalto
cmake .
make -j$(nproc)
# Binary at: ./pdfalto

Option C: Pre-built Binary

Check pdfalto releases for pre-built binaries.

Configure pdfalto Path

The pipeline searches these locations automatically:

  1. PDFALTO_PATH environment variable (recommended)
  2. ./executables/pdfalto (in repository)
  3. System PATH (which pdfalto)
  4. /usr/local/bin/pdfalto
  5. ~/Downloads/pdfalto

Set the environment variable:

export PDFALTO_PATH="/path/to/pdfalto"

Or add to your shell profile (~/.bashrc, ~/.zshrc):

echo 'export PDFALTO_PATH="/path/to/pdfalto"' >> ~/.bashrc
source ~/.bashrc

Verifying Installation

Run the test suite on the included example posters:

python poster_extraction.py \
    --annotation-dir ./example_posters \
    --output-dir ./test_output

Expected output:

  • JSON files in ./test_output/
  • Console shows extraction progress and metrics

Troubleshooting

CUDA Not Available

>>> import torch
>>> torch.cuda.is_available()
False

Solutions:

  • Verify NVIDIA drivers: nvidia-smi
  • Reinstall PyTorch with CUDA: pip install torch --index-url https://download.pytorch.org/whl/cu118

pdfalto Not Found

WARNING: pdfalto not found, falling back to PyMuPDF

Solutions:

  • Set PDFALTO_PATH environment variable
  • Place binary in ./executables/pdfalto
  • Verify binary is executable: chmod +x pdfalto

Out of Memory

torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

  • Close other GPU applications
  • Use 8-bit quantization (automatic for <16GB GPUs)
  • Process PDFs and images separately

Model Download Issues

OSError: We couldn't connect to huggingface.co

Solutions:

  • Check internet connection
  • Use offline mode with pre-downloaded models

Environment Variables

Variable Description Default
PDFALTO_PATH Path to pdfalto binary Auto-detected
CUDA_VISIBLE_DEVICES GPU device(s) to use All available

Next Steps