Convert scientific posters (PDF/images) into structured JSON metadata using Large Language Models.
pip install -r requirements.txt# If installed via pip
poster2json --annotation-dir "./posters" --output-dir "./output"
# Or run directly
python poster_extraction.py --annotation-dir "./posters" --output-dir "./output"docker compose up --buildSee Docker Setup for detailed instructions including Windows/WSL2 support.
PDF/Image → Raw Text Extraction → LLM JSON Structuring → Structured JSON
↓ ↓
[pdfalto] [Llama 3.1 8B]
[Qwen2-VL] Fine-tuned for posters
- PDF files → Processed via
pdfaltofor layout-aware text extraction - Image files → Processed via
Qwen2-VL-7Bvision-language model - All files → Structured into JSON by Llama-3.1-8B-Poster-Extraction
Output conforms to the poster-json-schema:
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "LastName, FirstName",
"givenName": "FirstName",
"familyName": "LastName",
"affiliation": ["Institution"]
}
],
"titles": [{ "title": "Poster Title" }],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "Description"] }],
"tableCaptions": [{ "captions": ["Table 1.", "Description"] }]
}| Requirement | Specification |
|---|---|
| GPU | CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via Docker/WSL2) |
The API does not accept file uploads. The frontend uploads poster files to Bunny storage and creates ExtractionJob records in the database. This service polls the database for new jobs, downloads the file from Bunny, runs extraction, and writes results to PosterMetadata.
# Set required environment variables
export DATABASE_URL="postgresql://user:pass@host:5432/dbname"
export BUNNY_STORAGE_ZONE="your-storage-zone"
export BUNNY_ACCESS_KEY="your-storage-zone-password"
# Start the API (starts background job worker)
python api.py
# Health check
curl http://localhost:8000/healthSee API Reference for full configuration and environment variables.
| Document | Description |
|---|---|
| Installation Guide | Detailed setup instructions |
| Docker Setup | Docker deployment & Windows support |
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
| API Reference | REST API documentation |
poster2json/
├── poster_extraction.py # Main extraction pipeline
├── api.py # Flask REST API
├── requirements.txt # Python dependencies
├── Dockerfile # Container build
├── docker-compose.yml # Docker orchestration
├── docs/ # Documentation
├── example_posters/ # Sample poster files
└── test_results/ # Validation outputs
Validation Results: 10/10 (100%) passing on test set
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.50–2.00 |
MIT License - see LICENSE for details.
Part of the FAIR Data Innovations Hub posters.science project.
Contributions welcome! Please open an issue to discuss proposed changes.