Skip to content

Strip + prune the venv to slim the runtime image (~30MB)#58

Open
jghoman wants to merge 1 commit into
mainfrom
slim-image
Open

Strip + prune the venv to slim the runtime image (~30MB)#58
jghoman wants to merge 1 commit into
mainfrom
slim-image

Conversation

@jghoman
Copy link
Copy Markdown
Collaborator

@jghoman jghoman commented May 14, 2026

Summary

Adds one RUN block in the Dockerfile's builder stage to (a) strip debug symbols from every .so in the venv via strip --strip-unneeded, and (b) delete bundled tests/ / __pycache__/ / *.pyi / *.pyc artifacts that have no runtime consumer. binutils is installed only in the builder stage and never lands in the final image.

Numbers

Image Size
ghcr.io/posthog/millpond:latest (current main) 459 MB
This PR (millpond:main-stripped locally) 429 MB
Δ −30 MB

The biggest contributors stripped are libarrow.so, libarrow_acero.so, libparquet.so, and librdkafka.so — all ship with full debug tables that aren't needed at runtime.

Sanity check

$ docker run --rm --entrypoint python millpond:main-stripped -c \
    "import duckdb, pyarrow, confluent_kafka, prometheus_client; from millpond import ducklake, schema; print('OK')"
OK

All imports succeed in the stripped image. The strip is on debug symbols only and the prune targets install-time artifacts.

Why this is safe

  • Strip debug symbols: standard image-slimming practice. Affects backtraces in coredumps; doesn't affect runtime behavior or error reporting. We don't ship debug coredumps anyway.
  • Prune tests/, __pycache__/, *.pyi, *.pyc: bundled package test directories aren't imported by the application; type stubs (.pyi) are pure annotations consumed by type-checkers, not runtime; .pyc files are regenerated by Python at first import if needed.
  • binutils install is in the builder stage; the final image's apt state is unchanged.

What this PR does not do

Other slimming levers we measured but skipped:

  • Splitting the msk-iam extra out of the default image → another −20MB but operational change (would need two published variants, and a real decision about which non-MSK deploys exist).
  • Distroless base → declined: Google's python3-debian12 is Python 3.11; we require 3.12.

The strip + prune is the zero-risk free win. Bigger cuts can come if there's a real driver.

Test plan

  • docker build . succeeds (the only file touched is the Dockerfile)
  • Image size confirmed via docker images
  • python -c "import …" smoke test in the built image
  • Verify CI image build still publishes a working image

Adds one RUN block in the builder stage that:

1. Installs binutils (builder-only; never lands in the final image).
2. Strips debug symbols from every .so in /app/.venv via `strip
   --strip-unneeded`. Most of the heft is PyArrow's libarrow.so and
   librdkafka.so; both ship with full debug tables.
3. Deletes bundled tests/, test/, __pycache__/, *.pyi, *.pyc inside
   the venv. These are install-time artifacts; the runtime doesn't
   need them.
4. Cleans /root/.cache, /tmp, and the apt lists.

Measured on this branch's Dockerfile (so the duckdb-cli + extensions
+ tools/ paths are unchanged):

  Before: ghcr.io/posthog/millpond:latest  459 MB
  After:  millpond:main-stripped           429 MB
  Delta:                                   -30 MB

Sanity-tested: `python -c "import duckdb, pyarrow, confluent_kafka,
prometheus_client; from millpond import ducklake, schema"` succeeds in
the stripped image.

No behavior change; the strip is on debug symbols only and the prune
targets install-time artifacts that have no runtime consumer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant