Posters Science Extraction API

Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.

Quick Start

pip install -r requirements.txt

Basic Usage

# If installed via pip
poster2json --annotation-dir "./posters" --output-dir "./output"

# Or run directly
python poster_extraction.py --annotation-dir "./posters" --output-dir "./output"

Docker (Recommended for Windows)

docker compose up --build

See Docker Setup for detailed instructions including Windows/WSL2 support.

How It Works

PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
                ↓                         ↓
           [pdfalto]              [Llama 3.1 8B]
           [Qwen2-VL]             Fine-tuned for posters

PDF files → Processed via pdfalto for layout-aware text extraction
Image files → Processed via Qwen2-VL-7B vision-language model
All files → Structured into JSON by Llama-3.1-8B-Poster-Extraction

Output Format

Output conforms to the poster-json-schema:

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "LastName, FirstName",
      "givenName": "FirstName",
      "familyName": "LastName",
      "affiliation": ["Institution"]
    }
  ],
  "titles": [{ "title": "Poster Title" }],
  "posterContent": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "captions": ["Figure 1.", "Description"] }],
  "tableCaptions": [{ "captions": ["Table 1.", "Description"] }]
}

System Requirements

Requirement	Specification
GPU	CUDA-capable, ≥16GB VRAM
RAM	≥32GB recommended
Python	3.10+
OS	Linux, macOS, Windows (via Docker/WSL2)

API Server

The API does not accept file uploads. The frontend uploads poster files to Bunny storage and creates ExtractionJob records in the database. This service polls the database for new jobs, downloads the file from Bunny, runs extraction, and writes results to PosterMetadata.

# Set required environment variables
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BUNNY_STORAGE_ZONE="your-storage-zone"
export BUNNY_ACCESS_KEY="your-storage-zone-password"

# Start the API (starts background job worker)
python api.py

# Health check
curl http://localhost:8000/health

See API Reference for full configuration and environment variables.

Documentation

Document	Description
Installation Guide	Detailed setup instructions
Docker Setup	Docker deployment & Windows support
Architecture	Technical details & methodology
Evaluation	Validation metrics & results
API Reference	REST API documentation

Project Structure

poster2json/
├── poster_extraction.py    # Main extraction pipeline
├── api.py                  # Flask REST API
├── requirements.txt        # Python dependencies
├── Dockerfile              # Container build
├── docker-compose.yml      # Docker orchestration
├── docs/                   # Documentation
├── example_posters/        # Sample poster files
└── test_results/           # Validation outputs

Performance

Validation Results: 10/10 (100%) passing on test set

Metric	Score	Threshold
Word Capture	0.96	≥0.75
ROUGE-L	0.89	≥0.75
Number Capture	0.93	≥0.75
Field Proportion	0.99	0.50–2.00

License

MIT License - see LICENSE for details.

Citation

Part of the FAIR Data Innovations Hub posters.science project.

Contributing

Contributions welcome! Please open an issue to discuss proposed changes.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
docs		docs
example_posters		example_posters
executables		executables
manual_poster_annotation		manual_poster_annotation
test_results		test_results
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pylint.ini		.pylint.ini
.pylintrc		.pylintrc
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
codemeta.json		codemeta.json
config.py		config.py
docker-compose-prod.yml		docker-compose-prod.yml
docker-compose.dev.yml		docker-compose.dev.yml
job_worker.py		job_worker.py
poster_extraction.py		poster_extraction.py
poster_extraction_schema.json		poster_extraction_schema.json
poster_schema.json		poster_schema.json
pyproject.toml		pyproject.toml
requirements-prod.txt		requirements-prod.txt
requirements.txt		requirements.txt
test.ipynb		test.ipynb
test_api.py		test_api.py
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Posters Science Extraction API

Quick Start

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Posters Science Extraction API

Quick Start

Basic Usage

Docker (Recommended for Windows)

How It Works

Output Format

System Requirements

API Server

Documentation

Project Structure

Performance

License

Citation

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages