perf(file): stream File.save() to avoid OOM on large files#1754
Merged
Conversation
… on large files File.save() loaded the full source file into memory via read_bytes() before passing the resulting bytes to Client.upload(), which then called fs.pipe_file(path, data). For a multi-GB artifact this peaks at file_size in RAM and OOMs constrained workers (observed at ~6 GB X_train.npy on a 7 GB container). This change: - Client.upload() now also accepts a binary file-like object. Bytes-like inputs keep the existing fast pipe_file path; readable streams are copied into the destination via fs.open(..., 'wb') using shutil.copyfileobj. The destination AbstractBufferedFile flushes a multipart part every blocksize bytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part size rather than the file size. - File.save() opens the source File and passes the read stream directly to Client.upload(), eliminating the intermediate full-file bytes blob. Verified with tracemalloc: saving a 210 MB file now peaks at ~0.2 MB allocations vs the previous ~210 MB.
for more information, see https://pre-commit.ci
Deploying datachain with
|
| Latest commit: |
dde35b5
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://08d0ced2.datachain-2g6.pages.dev |
| Branch Preview URL: | https://fix-streaming-upload.datachain-2g6.pages.dev |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses out-of-memory failures when saving large artifacts by changing uploads from a “materialize whole file into bytes” approach to a streaming approach that keeps peak memory bounded.
Changes:
- Update
Client.upload()(fsspec-backed clients) to accept either bytes-like data or a binary readable stream, streaming viashutil.copyfileobjfor the latter. - Update
File.save()to stream the source file intoClient.upload()instead of callingread_bytes(). - Add a unit test ensuring
FileClient.upload()accepts a binary stream.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/unit/test_client_local.py |
Adds coverage for upload() accepting a binary readable stream. |
src/datachain/lib/file.py |
Switches File.save() to stream from open() into client.upload(). |
src/datachain/client/http.py |
Widens upload() signature (still raises read-only NotImplementedError). |
src/datachain/client/fsspec.py |
Implements stream-capable upload() path and swaps copy2 import usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…p annotation - File.save: pass mode='rb' so subclasses defaulting to text (TextFile) still stream bytes into the destination. - Client.upload: validate non-bytes input has .read() and raise TypeError with a clear message; use a larger copy buffer (destination.blocksize or 8 MiB) rather than shutil's 16 KiB default to reduce per-chunk overhead on multi-GB uploads. - HTTPClient.upload: restore explicit type annotation matching the base class signature.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The probe-and-raise was overkill for an internal API. shutil.copyfileobj / dst.write will surface the same problem with a slightly less polished traceback if someone passes a text stream.
BinaryIO was imported but only referenced inside a quoted annotation in http.py, which Ruff/Pyflakes treat as unused. Unquote the annotation in both http.py and fsspec.py for consistency now that PEP 604 unions are runtime-evaluatable on supported Python versions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
File.save()materializes the entire source file into memory viaself.read_bytes()before handing the resultingbytestoClient.upload(), which callsfs.pipe_file(path, data). For a multi-GB artifact this peaks atfile_sizeof RAM and OOM-kills constrained workers — we hit this on a 7 GB k8s worker trying to save a ~6 GBX_train.npyartifact through DataChain Studio.fs.pipe_fileis bytes-only (singlef.write(value)under the hood), andfs.open(..., 'wb').write(huge_bytes)would also buffer the whole payload locally before flushing —AbstractBufferedFileonly flushes a multipart part onceself.blocksizeis reached, so a single big.write()accumulates everything inself.bufferfirst.Fix
Client.upload()now accepts a bytes-like object or a binary readable stream:pipe_filepath, unchanged.fs.open(path, 'wb')viashutil.copyfileobj. The destinationAbstractBufferedFileflushes a multipart part everyblocksizebytes (e.g. 5 MiB on s3fs), so peak RAM is bounded by the backend's part size rather than the file size.File.save()opens the source file and streams it directly intoClient.upload(), eliminating the intermediate full-file bytes blob.HTTPClient.uploadsignature widened to match (stillNotImplementedError).Verification
tracemallocsmoke test, saving a 210 MB local file viaFile.save():read_bytes()+pipe_file)shutil.copyfileobj)Tests
test_upload_accepts_binary_streamintests/unit/test_client_local.pycovers the stream path.tests/unit/test_client_local.py(69 passed, 6 Windows-only skipped) continues to pass — the bytes path is unchanged.Compatibility
bytes→bytes | bytearray | memoryview | BinaryIO); existing callers passingbytesare unaffected.bytesand exercise thepipe_filefast path.