Queue OCR tasks to speedup conversions #1358

almet · 2025-11-27T18:46:18Z

This allows to run the OCR at the same time we do the conversion,
and using all the available CPU.

This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs)

Before this change (time, on a 220 pages PDF, with OCR):

________________________________________________________
Executed in   44.78 secs    fish           external
   usr time  622.67 secs  608.00 micros  622.66 secs
   sys time    2.94 secs   62.00 micros    2.94 secs

After this change:

________________________________________________________
Executed in  274.08 secs    fish           external
   usr time  271.77 secs    1.06 millis  271.77 secs
   sys time    0.79 secs    0.01 millis    0.79 secs

apyrgio · 2026-01-26T22:47:49Z

Hey Alexis! I'm looking into your PR, and I have a couple of comments:

Your branch does not seem to work on my machine. The error I get is the following, which I believe happens in a worker context:
```
Usage: dangerzone-cli [OPTIONS] [FILENAMES]...
Try 'dangerzone-cli --help' for help.

Error: Missing argument 'FILENAMES...'
[INFO ] Running: /usr/bin/podman kill dangerzone-doc-to-pixels-AJSZCc
[WARNING] Could not kill container 'dangerzone-doc-to-pixels-AJSZCc' within 5 seconds
```
I think we have encountered something similar before, I need to look into it though. The good thing is that the tests work, so probably it's something on my side.
I'm wondering if we should disable multithreading on Tessearct's side, with OMP_THREAD_LIMIT=1.
Regarding the worker queue, I also want to check if concurrent.futures.ProcessPoolExecutor offers a nicer interface, maybe even a map-like one.
I was trying to understand how we can set the tessedit_do_invert=0 config value that I mentioned in Improve Conversion Speeds #1329 (comment), just to see if it's worth the hassle. Then I realized that we have a divergence in the Tesseract data. On Linux they contain the tessconfigs / configs file you see in https://github.com/tesseract-ocr/tessdata_fast, but on Windows / macOS we strip them away. So, I'm not sure if this causes any issue.

This allows to run the OCR at the same time we do the conversion, and using all the available CPU. This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs) Fixes #1329 Before this change (`time`, on a 220 pages PDF, with OCR): ``` ________________________________________________________ Executed in 44.78 secs fish external usr time 622.67 secs 608.00 micros 622.66 secs sys time 2.94 secs 62.00 micros 2.94 secs ``` After this change: ``` ________________________________________________________ Executed in 274.08 secs fish external usr time 271.77 secs 1.06 millis 271.77 secs sys time 0.79 secs 0.01 millis 0.79 secs ```

Having this parsing done in the `__init__.py` script leads to issues down the line when one wants to import parts of the code without having to parse the CLI arguments. Doing so enables us to remove the `dev_scripts/dangerzone*` scripts, which were forcing the project into the path and enabling dev mode. - The dev mode is now handled by the use of a DANGERZONE_DEV=1 environment variable; and - The path changes were actually unnecessary when using poetry run, which already does that for us.

github-project-automation bot added this to Dangerzone ✨ Nov 27, 2025

almet changed the title ~~Queue OCR tasks~~ Queue OCR tasks to speedup conversions Nov 27, 2025

almet force-pushed the queue-ocr branch 4 times, most recently from 5e2775d to 2efb30e Compare January 20, 2026 16:31

almet marked this pull request as ready for review January 20, 2026 16:36

almet requested a review from apyrgio January 21, 2026 09:01

almet added 3 commits February 5, 2026 17:17

Remove unused MINIMUM_DOCKER_DESKTOP version

0eb71d4

almet force-pushed the queue-ocr branch from 2efb30e to 046ed0b Compare February 5, 2026 16:53

almet added 6 commits February 9, 2026 09:08

FIXUP: update docs and CI about dev_script/dangerzone-* commands

ac51f15

FIXUP: update the entrypoint used by dangerzone-machine

380e6f5

FIXUP pass DANGERZONE_DEV=1

d1eeaeb

FIXUP: put back the name in tool.poetry

218a0dd

FIx lint

28cabd3

fixup: upate qa and windows scripts

ca448bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue OCR tasks to speedup conversions #1358

Queue OCR tasks to speedup conversions #1358

Uh oh!

almet commented Nov 27, 2025

Uh oh!

apyrgio commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Queue OCR tasks to speedup conversions #1358

Are you sure you want to change the base?

Queue OCR tasks to speedup conversions #1358

Uh oh!

Conversation

almet commented Nov 27, 2025

Uh oh!

apyrgio commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants