Skip to content

Conversation

@almet
Copy link
Member

@almet almet commented Nov 27, 2025

This allows to run the OCR at the same time we do the conversion,
and using all the available CPU.

This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs)

Fixes #1329

Before this change (time, on a 220 pages PDF, with OCR):

________________________________________________________
Executed in   44.78 secs    fish           external
   usr time  622.67 secs  608.00 micros  622.66 secs
   sys time    2.94 secs   62.00 micros    2.94 secs

After this change:

________________________________________________________
Executed in  274.08 secs    fish           external
   usr time  271.77 secs    1.06 millis  271.77 secs
   sys time    0.79 secs    0.01 millis    0.79 secs

@almet almet changed the title Queue OCR tasks Queue OCR tasks to speedup conversions Nov 27, 2025
@almet almet force-pushed the queue-ocr branch 4 times, most recently from 5e2775d to 2efb30e Compare January 20, 2026 16:31
@almet almet marked this pull request as ready for review January 20, 2026 16:36
@almet almet requested a review from apyrgio January 21, 2026 09:01
@apyrgio
Copy link
Contributor

apyrgio commented Jan 26, 2026

Hey Alexis! I'm looking into your PR, and I have a couple of comments:

  1. Your branch does not seem to work on my machine. The error I get is the following, which I believe happens in a worker context:

    Usage: dangerzone-cli [OPTIONS] [FILENAMES]...
    Try 'dangerzone-cli --help' for help.
    
    Error: Missing argument 'FILENAMES...'
    [INFO ] Running: /usr/bin/podman kill dangerzone-doc-to-pixels-AJSZCc
    [WARNING] Could not kill container 'dangerzone-doc-to-pixels-AJSZCc' within 5 seconds
    

    I think we have encountered something similar before, I need to look into it though. The good thing is that the tests work, so probably it's something on my side.

  2. I'm wondering if we should disable multithreading on Tessearct's side, with OMP_THREAD_LIMIT=1.

  3. Regarding the worker queue, I also want to check if concurrent.futures.ProcessPoolExecutor offers a nicer interface, maybe even a map-like one.

  4. I was trying to understand how we can set the tessedit_do_invert=0 config value that I mentioned in Improve Conversion Speeds #1329 (comment), just to see if it's worth the hassle. Then I realized that we have a divergence in the Tesseract data. On Linux they contain the tessconfigs / configs file you see in https://github.com/tesseract-ocr/tessdata_fast, but on Windows / macOS we strip them away. So, I'm not sure if this causes any issue.

almet added 3 commits February 5, 2026 17:17
This allows to run the OCR at the same time we do the conversion,
and using all the available CPU.

This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs)

Fixes #1329

Before this change (`time`, on a 220 pages PDF, with OCR):

```
________________________________________________________
Executed in   44.78 secs    fish           external
   usr time  622.67 secs  608.00 micros  622.66 secs
   sys time    2.94 secs   62.00 micros    2.94 secs

```

After this change:

```
________________________________________________________
Executed in  274.08 secs    fish           external
   usr time  271.77 secs    1.06 millis  271.77 secs
   sys time    0.79 secs    0.01 millis    0.79 secs
```
Having this parsing done in the `__init__.py` script leads to issues
down the line when one wants to import parts of the code without having
to parse the CLI arguments.

Doing so enables us to remove the `dev_scripts/dangerzone*` scripts,
which were forcing the project into the path and enabling dev mode.

- The dev mode is now handled by the use of a DANGERZONE_DEV=1 environment
variable; and
- The path changes were actually unnecessary when using poetry run,
  which already does that for us.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Improve Conversion Speeds

2 participants