perf(file): stream File.save() to avoid OOM on large files by shcheklein · Pull Request #1754 · datachain-ai/datachain

shcheklein · 2026-05-07T23:38:58Z

Problem

File.save() materializes the entire source file into memory via self.read_bytes() before handing the resulting bytes to Client.upload(), which calls fs.pipe_file(path, data). For a multi-GB artifact this peaks at file_size of RAM and OOM-kills constrained workers — we hit this on a 7 GB k8s worker trying to save a ~6 GB X_train.npy artifact through DataChain Studio.

fs.pipe_file is bytes-only (single f.write(value) under the hood), and fs.open(..., 'wb').write(huge_bytes) would also buffer the whole payload locally before flushing — AbstractBufferedFile only flushes a multipart part once self.blocksize is reached, so a single big .write() accumulates everything in self.buffer first.

Fix

Client.upload() now accepts a bytes-like object or a binary readable stream:
- bytes-like → existing fast pipe_file path, unchanged.
- stream → copied into fs.open(path, 'wb') via shutil.copyfileobj. The destination AbstractBufferedFile flushes a multipart part every blocksize bytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part size rather than the file size.
File.save() opens the source file and streams it directly into Client.upload(), eliminating the intermediate full-file bytes blob.
HTTPClient.upload signature widened to match (still NotImplementedError).

Verification

tracemalloc smoke test, saving a 210 MB local file via File.save():

version	peak alloc
before (`read_bytes()` + `pipe_file`)	~210 MB (= file size)
after (streamed via `shutil.copyfileobj`)	~0.2 MB

import os, tempfile, tracemalloc
from datachain import Session
from datachain.lib.file import File

SIZE = 200 * 1024 * 1024
with tempfile.TemporaryDirectory() as a, tempfile.TemporaryDirectory() as b:
    src = os.path.join(a, 'big.bin'); open(src, 'wb').write(b'\x00' * SIZE)
    sess = Session.get()
    f = File.at(f'file://{src}'); f._set_stream(sess.catalog)
    tracemalloc.start(); f.save(os.path.join(b, 'out.bin'))
    _, peak = tracemalloc.get_traced_memory(); tracemalloc.stop()
    print(peak / 1e6, 'MB')   # ~0.2

Tests

New unit test test_upload_accepts_binary_stream in tests/unit/test_client_local.py covers the stream path.
Existing tests/unit/test_client_local.py (69 passed, 6 Windows-only skipped) continues to pass — the bytes path is unchanged.

Compatibility

Public signature widened (bytes → bytes | bytearray | memoryview | BinaryIO); existing callers passing bytes are unaffected.
All in-tree call sites continue to pass bytes and exercise the pipe_file fast path.

… on large files File.save() loaded the full source file into memory via read_bytes() before passing the resulting bytes to Client.upload(), which then called fs.pipe_file(path, data). For a multi-GB artifact this peaks at file_size in RAM and OOMs constrained workers (observed at ~6 GB X_train.npy on a 7 GB container). This change: - Client.upload() now also accepts a binary file-like object. Bytes-like inputs keep the existing fast pipe_file path; readable streams are copied into the destination via fs.open(..., 'wb') using shutil.copyfileobj. The destination AbstractBufferedFile flushes a multipart part every blocksize bytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part size rather than the file size. - File.save() opens the source File and passes the read stream directly to Client.upload(), eliminating the intermediate full-file bytes blob. Verified with tracemalloc: saving a 210 MB file now peaks at ~0.2 MB allocations vs the previous ~210 MB.

for more information, see https://pre-commit.ci

cloudflare-workers-and-pages · 2026-05-07T23:39:13Z

Deploying datachain with Cloudflare Pages

Latest commit:	`dde35b5`
Status:	✅ Deploy successful!
Preview URL:	https://08d0ced2.datachain-2g6.pages.dev
Branch Preview URL:	https://fix-streaming-upload.datachain-2g6.pages.dev

View logs

Copilot

Pull request overview

This PR addresses out-of-memory failures when saving large artifacts by changing uploads from a “materialize whole file into bytes” approach to a streaming approach that keeps peak memory bounded.

Changes:

Update Client.upload() (fsspec-backed clients) to accept either bytes-like data or a binary readable stream, streaming via shutil.copyfileobj for the latter.
Update File.save() to stream the source file into Client.upload() instead of calling read_bytes().
Add a unit test ensuring FileClient.upload() accepts a binary stream.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`tests/unit/test_client_local.py`	Adds coverage for `upload()` accepting a binary readable stream.
`src/datachain/lib/file.py`	Switches `File.save()` to stream from `open()` into `client.upload()`.
`src/datachain/client/http.py`	Widens `upload()` signature (still raises read-only `NotImplementedError`).
`src/datachain/client/fsspec.py`	Implements stream-capable `upload()` path and swaps `copy2` import usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…p annotation - File.save: pass mode='rb' so subclasses defaulting to text (TextFile) still stream bytes into the destination. - Client.upload: validate non-bytes input has .read() and raise TypeError with a clear message; use a larger copy buffer (destination.blocksize or 8 MiB) rather than shutil's 16 KiB default to reduce per-chunk overhead on multi-GB uploads. - HTTPClient.upload: restore explicit type annotation matching the base class signature.

dmpetrov

LG

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

codecov · 2026-05-08T01:15:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

dreadatour

🔥

The probe-and-raise was overkill for an internal API. shutil.copyfileobj / dst.write will surface the same problem with a slightly less polished traceback if someone passes a text stream.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

BinaryIO was imported but only referenced inside a quoted annotation in http.py, which Ruff/Pyflakes treat as unused. Unquote the annotation in both http.py and fsspec.py for consistency now that PEP 604 unions are runtime-evaluatable on supported Python versions.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

shcheklein and others added 2 commits May 7, 2026 16:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0ed0be

for more information, see https://pre-commit.ci

shcheklein self-assigned this May 7, 2026

shcheklein requested review from a team and Copilot May 7, 2026 23:39

shcheklein added performance priority-p1 labels May 7, 2026

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread src/datachain/lib/file.py Outdated

Comment thread src/datachain/client/fsspec.py Outdated

Comment thread src/datachain/client/fsspec.py

Comment thread src/datachain/client/http.py Outdated

shcheklein requested a review from Copilot May 8, 2026 01:10

Copilot started reviewing on behalf of shcheklein May 8, 2026 01:10 View session

dmpetrov approved these changes May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/datachain/client/fsspec.py Outdated

dreadatour approved these changes May 8, 2026

View reviewed changes

simplify: drop stream type-validation in Client.upload

1d3061b

The probe-and-raise was overkill for an internal API. shutil.copyfileobj / dst.write will surface the same problem with a slightly less polished traceback if someone passes a text stream.

shcheklein requested a review from Copilot May 8, 2026 01:19

Copilot started reviewing on behalf of shcheklein May 8, 2026 01:20 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/datachain/client/http.py

shcheklein requested a review from Copilot May 8, 2026 01:28

Copilot started reviewing on behalf of shcheklein May 8, 2026 01:28 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Comment thread src/datachain/client/fsspec.py

Comment thread src/datachain/client/fsspec.py

shcheklein merged commit 4a23aef into main May 8, 2026
37 checks passed

shcheklein deleted the fix/streaming-upload branch May 8, 2026 02:39

Uh oh!

Conversation

shcheklein commented May 7, 2026

Problem

Fix

Verification

Tests

Compatibility

Uh oh!

cloudflare-workers-and-pages Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain with Cloudflare Pages

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026

Codecov Report

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloudflare-workers-and-pages Bot commented May 7, 2026 •

edited

Loading