-
Notifications
You must be signed in to change notification settings - Fork 234
Queue OCR tasks to speedup conversions #1358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5e2775d to
2efb30e
Compare
|
Hey Alexis! I'm looking into your PR, and I have a couple of comments:
|
This allows to run the OCR at the same time we do the conversion, and using all the available CPU. This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs) Fixes #1329 Before this change (`time`, on a 220 pages PDF, with OCR): ``` ________________________________________________________ Executed in 44.78 secs fish external usr time 622.67 secs 608.00 micros 622.66 secs sys time 2.94 secs 62.00 micros 2.94 secs ``` After this change: ``` ________________________________________________________ Executed in 274.08 secs fish external usr time 271.77 secs 1.06 millis 271.77 secs sys time 0.79 secs 0.01 millis 0.79 secs ```
Having this parsing done in the `__init__.py` script leads to issues down the line when one wants to import parts of the code without having to parse the CLI arguments. Doing so enables us to remove the `dev_scripts/dangerzone*` scripts, which were forcing the project into the path and enabling dev mode. - The dev mode is now handled by the use of a DANGERZONE_DEV=1 environment variable; and - The path changes were actually unnecessary when using poetry run, which already does that for us.
This allows to run the OCR at the same time we do the conversion,
and using all the available CPU.
This results in a x2.4 speedup on my quite beefy machine (AMD Ryzen, 16 CPUs)
Fixes #1329
Before this change (
time, on a 220 pages PDF, with OCR):After this change: