Skip to content

Comments

fix(save): make isolated and atomic#1603

Open
shcheklein wants to merge 2 commits intomainfrom
fix-save-race-condition
Open

fix(save): make isolated and atomic#1603
shcheklein wants to merge 2 commits intomainfrom
fix-save-race-condition

Conversation

@shcheklein
Copy link
Contributor

Concurrent save() calls on the same dataset can corrupt data by double-inserting rows into the same version table.

UUID-based conflict detection in create_dataset_version (works across SQLite, PostgreSQL, Studio API), retry loop for auto-versioned saves, and a staging table pattern in save() — write to temp table first, claim version metadata, then rename.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make concurrent save() operations on the same dataset isolated and effectively atomic to prevent version-table corruption and duplicate inserts under concurrency.

Changes:

  • Implemented a staging-table “write then claim version then rename” flow in DatasetQuery.save().
  • Added UUID-based conflict detection to metastore.create_dataset_version() and a retry loop for auto-versioned dataset creation in Catalog.create_dataset().
  • Updated functional tests to reflect the new create_dataset_version() return type and the fact that create_dataset() no longer creates a rows table.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_atomicity.py Removes create_rows usage to align with new dataset creation semantics.
tests/func/test_metastore.py Updates expectations for create_dataset_version() returning (ds, version_created).
tests/func/test_datasets.py Refactors fixtures/tests away from create_rows and adds coverage for “no rows table created”.
src/datachain/query/dataset.py Implements staging-table save: temp table insert → claim version metadata → rename → finalize.
src/datachain/lib/dc/records.py Creates rows table explicitly since create_dataset() is now metadata-only.
src/datachain/data_storage/metastore.py Returns (dataset, version_created) and uses UUID equality as conflict/ownership detection.
src/datachain/catalog/catalog.py Removes create_rows behavior, adds version-claim retry logic, and changes create_dataset_version() to metadata-only + tuple return.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chain = dc.read_values(val=[1, 2, 3, 4], session=test_session).save(name)
catalog = test_session.catalog
return catalog.update_dataset(
chain.dataset, {"description": "test dataset", "attrs": ["test", "dataset"]}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalog.update_dataset takes updates via keyword arguments; passing a dict as the second positional arg binds it to conn and results in no fields being updated. Unpack the dict (**{...}) or pass description= / attrs= directly so the fixture actually updates the dataset metadata.

Suggested change
chain.dataset, {"description": "test dataset", "attrs": ["test", "dataset"]}
chain.dataset,
description="test dataset",
attrs=["test", "dataset"],

Copilot uses AI. Check for mistakes.
Comment on lines +140 to +143
# Create the rows table (create_dataset only creates metadata).
table_name = warehouse.dataset_table_name(dsr, dsr.latest_version)
warehouse.create_dataset_rows_table(table_name, columns=columns)

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After create_dataset() was changed to create metadata only, this function now creates and populates the rows table but never calls catalog.update_dataset_version_with_warehouse_info(...). That leaves the version marked COMPLETE with num_objects/size/preview potentially unset; update the warehouse info before setting status COMPLETE (or explicitly document that read_records skips these fields).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment make sense to me, can you check. Also, why did we move creating warehouse table outside catalog? Now we need to do all this stuff like create table, update stats and maybe other things in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check this is kind-a draft still, I'm iterating on it.

In short - the idea here is to never create the warehouse table with its actual name initially. Create a temporary table and rename it at the end. Temporary table is created in a different place.

this read_records is unique (and probably should be redesigned).

I'm still not sure if this design overall is right ...

@shcheklein shcheklein force-pushed the fix-save-race-condition branch from b346f12 to 076e138 Compare February 19, 2026 02:18
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 19, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 68019c1
Status: ✅  Deploy successful!
Preview URL: https://a0a31138.datachain-2g6.pages.dev
Branch Preview URL: https://fix-save-race-condition.datachain-2g6.pages.dev

View logs

try:
self.catalog.warehouse.rename_table(temp_table, final_table_name)
except Exception:
with contextlib.suppress(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cleanup dataset metadata or mark it as failed here?

Comment on lines +140 to +143
# Create the rows table (create_dataset only creates metadata).
table_name = warehouse.dataset_table_name(dsr, dsr.latest_version)
warehouse.create_dataset_rows_table(table_name, columns=columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment make sense to me, can you check. Also, why did we move creating warehouse table outside catalog? Now we need to do all this stuff like create table, update stats and maybe other things in multiple places.

@shcheklein shcheklein force-pushed the fix-save-race-condition branch from 076e138 to a3fea9d Compare February 19, 2026 23:54
@shcheklein shcheklein force-pushed the fix-save-race-condition branch from a3fea9d to 68019c1 Compare February 20, 2026 02:43
@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 61.29032% with 24 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 62.50% 9 Missing and 3 partials ⚠️
src/datachain/query/dataset.py 47.82% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants