Skip to content

perf(file): stream File.save() to avoid OOM on large files#1754

Merged
shcheklein merged 5 commits into
mainfrom
fix/streaming-upload
May 8, 2026
Merged

perf(file): stream File.save() to avoid OOM on large files#1754
shcheklein merged 5 commits into
mainfrom
fix/streaming-upload

Conversation

@shcheklein

Copy link
Copy Markdown
Contributor

Problem

File.save() materializes the entire source file into memory via self.read_bytes() before handing the resulting bytes to Client.upload(), which calls fs.pipe_file(path, data). For a multi-GB artifact this peaks at file_size of RAM and OOM-kills constrained workers — we hit this on a 7 GB k8s worker trying to save a ~6 GB X_train.npy artifact through DataChain Studio.

fs.pipe_file is bytes-only (single f.write(value) under the hood), and fs.open(..., 'wb').write(huge_bytes) would also buffer the whole payload locally before flushing — AbstractBufferedFile only flushes a multipart part once self.blocksize is reached, so a single big .write() accumulates everything in self.buffer first.

Fix

  • Client.upload() now accepts a bytes-like object or a binary readable stream:
    • bytes-like → existing fast pipe_file path, unchanged.
    • stream → copied into fs.open(path, 'wb') via shutil.copyfileobj. The destination AbstractBufferedFile flushes a multipart part every blocksize bytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part size rather than the file size.
  • File.save() opens the source file and streams it directly into Client.upload(), eliminating the intermediate full-file bytes blob.
  • HTTPClient.upload signature widened to match (still NotImplementedError).

Verification

tracemalloc smoke test, saving a 210 MB local file via File.save():

version peak alloc
before (read_bytes() + pipe_file) ~210 MB (= file size)
after (streamed via shutil.copyfileobj) ~0.2 MB
import os, tempfile, tracemalloc
from datachain import Session
from datachain.lib.file import File

SIZE = 200 * 1024 * 1024
with tempfile.TemporaryDirectory() as a, tempfile.TemporaryDirectory() as b:
    src = os.path.join(a, 'big.bin'); open(src, 'wb').write(b'\x00' * SIZE)
    sess = Session.get()
    f = File.at(f'file://{src}'); f._set_stream(sess.catalog)
    tracemalloc.start(); f.save(os.path.join(b, 'out.bin'))
    _, peak = tracemalloc.get_traced_memory(); tracemalloc.stop()
    print(peak / 1e6, 'MB')   # ~0.2

Tests

  • New unit test test_upload_accepts_binary_stream in tests/unit/test_client_local.py covers the stream path.
  • Existing tests/unit/test_client_local.py (69 passed, 6 Windows-only skipped) continues to pass — the bytes path is unchanged.

Compatibility

  • Public signature widened (bytesbytes | bytearray | memoryview | BinaryIO); existing callers passing bytes are unaffected.
  • All in-tree call sites continue to pass bytes and exercise the pipe_file fast path.

shcheklein and others added 2 commits May 7, 2026 16:38
… on large files

File.save() loaded the full source file into memory via read_bytes() before
passing the resulting bytes to Client.upload(), which then called
fs.pipe_file(path, data). For a multi-GB artifact this peaks at file_size in
RAM and OOMs constrained workers (observed at ~6 GB X_train.npy on a 7 GB
container).

This change:
- Client.upload() now also accepts a binary file-like object. Bytes-like
  inputs keep the existing fast pipe_file path; readable streams are copied
  into the destination via fs.open(..., 'wb') using shutil.copyfileobj. The
  destination AbstractBufferedFile flushes a multipart part every blocksize
  bytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part
  size rather than the file size.
- File.save() opens the source File and passes the read stream directly to
  Client.upload(), eliminating the intermediate full-file bytes blob.

Verified with tracemalloc: saving a 210 MB file now peaks at ~0.2 MB
allocations vs the previous ~210 MB.
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented May 7, 2026

Copy link
Copy Markdown

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: dde35b5
Status: ✅  Deploy successful!
Preview URL: https://08d0ced2.datachain-2g6.pages.dev
Branch Preview URL: https://fix-streaming-upload.datachain-2g6.pages.dev

View logs

@shcheklein shcheklein self-assigned this May 7, 2026
@shcheklein shcheklein requested review from a team and Copilot May 7, 2026 23:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses out-of-memory failures when saving large artifacts by changing uploads from a “materialize whole file into bytes” approach to a streaming approach that keeps peak memory bounded.

Changes:

  • Update Client.upload() (fsspec-backed clients) to accept either bytes-like data or a binary readable stream, streaming via shutil.copyfileobj for the latter.
  • Update File.save() to stream the source file into Client.upload() instead of calling read_bytes().
  • Add a unit test ensuring FileClient.upload() accepts a binary stream.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/unit/test_client_local.py Adds coverage for upload() accepting a binary readable stream.
src/datachain/lib/file.py Switches File.save() to stream from open() into client.upload().
src/datachain/client/http.py Widens upload() signature (still raises read-only NotImplementedError).
src/datachain/client/fsspec.py Implements stream-capable upload() path and swaps copy2 import usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/datachain/lib/file.py Outdated
Comment thread src/datachain/client/fsspec.py Outdated
Comment thread src/datachain/client/fsspec.py
Comment thread src/datachain/client/http.py Outdated
…p annotation

- File.save: pass mode='rb' so subclasses defaulting to text (TextFile) still
  stream bytes into the destination.
- Client.upload: validate non-bytes input has .read() and raise TypeError with
  a clear message; use a larger copy buffer (destination.blocksize or 8 MiB)
  rather than shutil's 16 KiB default to reduce per-chunk overhead on
  multi-GB uploads.
- HTTPClient.upload: restore explicit type annotation matching the base
  class signature.

@dmpetrov dmpetrov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread src/datachain/client/fsspec.py Outdated
@codecov

codecov Bot commented May 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@dreadatour dreadatour left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

The probe-and-raise was overkill for an internal API. shutil.copyfileobj /
dst.write will surface the same problem with a slightly less polished
traceback if someone passes a text stream.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread src/datachain/client/http.py
BinaryIO was imported but only referenced inside a quoted annotation in
http.py, which Ruff/Pyflakes treat as unused. Unquote the annotation in
both http.py and fsspec.py for consistency now that PEP 604 unions are
runtime-evaluatable on supported Python versions.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread src/datachain/client/fsspec.py
Comment thread src/datachain/client/fsspec.py
@shcheklein shcheklein merged commit 4a23aef into main May 8, 2026
37 checks passed
@shcheklein shcheklein deleted the fix/streaming-upload branch May 8, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants